Google file system in big data

The Google file system (GFS) is a proprietary distributed file system, which was created as a solution to serve the highly concurrent streaming data applications managed by Google. Like other distributed file systems, GFS aimed to provide a reliable, scalable performant and highly available storage system, relying on many of the same system primitives, from operation logging, to data replication, snapshots, checkpoints and much more.

Despite many of these similarities with existing systems, the design of GFS being tightly coupled with the workloads of its clients, the embrace of aggregating massive amounts of commodity hardware combined with a fault-tolerant design, and many other astute architectural decisions have allowed it to be a critical system at one of the world’s premier data companies.

credit: google.com

System Assumptions

There are several core assumptions which constrain the design of GFS, and they are centered around the typical workload of GFS clients, they are as follows:

read workload consists mainly of large streaming reads (several hundred kb to multiple mb, over contiguous region of a file), with some small random reads
write workload consists mainly of large sequential appends to files (similar size to large reads), with few small random writes (files tend to be unmodified after being written)
GFS files will typically experience highly concurrent append heavy workloads, and its crucial these operations are implemented by the system efficiently
clients tend to engage in large data processing jobs, which rely on maintaining sustained high bandwidth, more so than optimizing for individual operation latency
GFS is designed for large files (typically hundreds of mb to few gb), consisting of 64 mb chunks

High-level Architecture

GFS architecture, Google File System, Ghemawat et. al.

GFS clusters consist of a single master node, which stores all file system metadata, including the following:

file and chunk namespace
file-to-chunk mapping
chunk replica locations

One of the design decisions made to vastly optimize GFS is the combination of large files and little metadata stored per chunk. The size of the metadata per chunk is approximately 64 bytes, and since the files stored in GFS are typically very large, the overall number of files that must be accounted for is fairly low. For this reason, all GFS metadata is stored in-memory, and this heavily optimizes metadata operations as a whole.

In addition to the master node, GFS clusters consist of a variable number of chunkserver nodes, across which data is partitioned and replicated. Chunks are stored on chunkserver nodes (stored as linux files), and are accessed by their respective chunk handles combined with a byte offset.

Since chunk replica locations can be determined by communicating with all chunk servers on startup, this is the only metadata not persisted by the master.

Decoupling data and control flow

Separating the file system metadata (stored solely by the master) and file data (stored across chunk servers) ensures that the single master does not result in a bottleneck.

“The client translates the file name and byte offset … into a chunk index …it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information…”

From this point onward, any data operations require no interaction between the client and the master node and can be directly sent to the chunk replica, while the master continues to serve metadata operations.

The importance of the master only being involved in metadata operations comes in simplicity. This simplifies how the namespace and metadata is handled, since there is only a single source of truth, and also simplifies the decisions that need to be made for chunk placement and replication.

Operation Log

The operation log serves a similar purpose to a database transaction log, and is used to ensure a consistent logical timeline of file system metadata mutations. On any metadata mutations, the master first logs these operations in a serial order to the operation log, and replicates the log, before committing the mutations and responding to the client.

Maintaining an operation log ensures we can restore the state of the master by simply replaying the series of mutations written to the log, as well as ensuring that failed operations do not leave the file system metadata in a corrupted state.

To further optimize restoring the master state, GFS takes checkpoints, which encompass a series of operations in the operation log, allowing the log to be truncated to only operations post-checkpoint, which can be replayed after applying the checkpoint to restore the master state. These checkpoints are persisted to disk, with a handful of previous operation logs and checkpoints also persisted for disaster recovery.

Chunk Leases

Since the master is not involved in data operations, it must transfer the power to one of the chunk replicas to determine a serialized mutation order. This is handled by assigning a lease to one of the replicas for a particular chunk.

Chunk leases are revoked after a short timeout, or can be revoked manually by the master if the chunk needs to be shielded from any mutations. For the most part though, chunks which are consistently being mutated will not have their leases revoked as chunkservers can request to extend these leases through communication with the master.

Clients will initially query the master for which chunk replica holds the lease and the location of all replicas (for the chunk they are mutating). This information will be returned by the master and cached for continual use (the master will select a chunk replica to give the lease if none already hold one).

From this point onward, the client can simply send the data for the operation, and these operations will be propagated to all replicas. Finally, once all replicas have acknowledged they have received the data, the client will issue the write request to the primary replica (the replica holding the lease), which will determine a serial order for the mutations, that will subsequently be applied in the same fashion to all replicas.

Since the actual transmission of data from the client to the replicas is decoupled from the determination of serial order by the primary, and subsequent commit of the mutations, this also creates an element of network optimization, where data flow can be based on the network topology of the chunk replicas, rather than first sending the data to the primary chunk.

Consistency Model

The GFS consistency model is consistent with the general philosophy of GFS, designed to be flexible to the typical workload of clients. It is based on the idea that following mutations, which may or may not succeed, and maybe concurrent, regions in a file can be in one of three states.

consistent: clients see the same data regardless of the replica read from, concurrent mutations can leave the data consistent, but the data may be made up of fragments of multiple mutations
defined: the region is consistent and the mutation succeeds without interference from concurrent writers (all clients see what the mutation has written)
undefined: different clients may see different data (result of a failed mutation)

GFS clients only need be able to distinguish between defined and undefined regions. This is due to the nature of serial mutation order being selected by the chunk replica currently holding the lease, meaning the mutations will be applied in the same way to all replicas. Additionally, since GFS uses versioning for chunks, stale chunks will not have mutations applied, and will not have their locations sent to the client by the master, ensuring all replicas that can be read will be consistent with one another.

Although stale chunk replicas may have their locations cached by the client (and therefore could be read from), the append-heavy workload of GFS clients tends to minimize the adverse effects of reading stale data.

It is important to recognize that the reasons that GFS is able to relax its consistency model are due to an intimate understanding of the workloads of its clients. In particular, GFS emphasizes clients rely on appending for most writes, checkpointing, and writing self-validating records, as means to ensure that the relaxed consistency model does not adversely affect the client.

Replica Placement

Chunk replica placement occurs for multiple reasons, which are as follows:

replicating chunks on creation to provide data integrity
re-replicating chunks when existing replicas have become stale or chunkservers become unresponsive
load balancing disk utilization across chunkservers

The factors considered when placing chunks replicas are choosing chunkservers with low disk utilization (to evenly utilize disks across chunkservers), avoiding creating many chunks on a single chunkserver at once, and spreading replicas across racks. The first point is self-explanatory and serves as a load balancing heuristic. The second point is due to the workload patterns of GFS clients, which typically stream large amounts of writes to newly created chunks (thus avoiding hot spots that would occur if many replicas are placed at the same time on the same chunkserver). The last point is an extra layer of fault tolerance that is added through replica placement itself.

Chunks are re-replicated when the healthy replica count falls below the configured threshold, which by default is 3, but can be dynamically configured.

Finally, the master will periodically load balance chunk replicas as a background task. This load balancing follows the same heuristics of initial replica placement, but also has an additional element of attempting to remove a replica on a chunkserver with higher than average disk utilization, versus one from a lower than average disk utilization chunkserver.

Garbage Collection

Like GFS file allocation, which is executed lazily, garbage collection is a lazily handled background task, which is a combination of periodic communication between chunkservers and the master, and scanning and modification of the file system metadata by the master.

When files are deleted, they are simply “hidden” by name only in the file system namespace, and the chunks are not immediately reclaimed (thus allowing a window of time where they can be read by their hidden name, and even undeleted). After some time, the master will remove the file remove it from the namespace, which will purge all its in-memory metadata, severing all mappings from the deleted file to the chunks it consisted of. The master will eventually scan for all orphaned chunks, and remove their metadata as well.

Since there is a clear mapping of files to chunks that is maintained entirely by a single master, garbage collection is especially easy for GFS, and is simply a matter of identifying orphaned chunks, purging their metadata, and informing chunkservers

Periodically, the chunkservers and master will exchange messages, where the chunkserver will send the chunk replicas it maintains to the master, while the master responds back with the chunks that are no longer present in the metadata, which the chunkservers are then free to reclaim.

This asynchronous process of first purging metadata for chunks, and eventually reclaiming them through communication is very simple, and also lends itself to being useful when chunk creation fails and leaves chunks that are not usable, which will then be reclaimed through communication between the chunkserver and the master.

This process of garbage collection also allows amortization of storage reclamation, by batching chunks to reclaim for each chunkserver, rather than eagerly one at a time.

Key Takeaways

There are many remarkable elements of GFS, and I think some of the keys ones are the following:

simplifying metadata operations and replica placement/management with a single master architecture
decoupling metadata operations from data operations, and using chunk leases to avoid involving the master in serializing concurrent mutation order
relying on checkpointing and logging for consistency, and fast recovery of master
implementing robust methods for identifying and remedying component failure
Lazy disk allocation and garbage collection through messaging and metadata purging
separating data flow from commit of mutations to chunks
aggressively designing all elements of the system for specific client workload

The first three points work in parallel to simplify the architecture of GFS greatly, without sacrificing a performant system. Having a single master consolidates metadata operations to a single point of the system, while also making communication with chunkservers and placement/replication decisions easier for the same reason. The latter two points allows this to be practical by effectively removing any barriers to success in terms of the master becoming a performance bottleneck or putting the system in danger of becoming inconsistent.

The fourth point is critical to allowing one of the defining points of GFS, the use of densely packed clusters of commodity hardware. By understanding from the initial design that failure is inherent in systems with cheaper components, any problems hardware failure could cause are remedied effectively and at scale, which makes the use of cheaper hardware feasible.

The fifth and sixth points are both very interesting disk level optimization that allow making complicated mechanisms like GC and replication fairly simple and performant. By allowing the data to be transmitted in any order from the client, while gatekeeping commits through the use of a chunk lease, GFS can make its two-phase commit-like replication scheme performant, while also ensuring serial mutation order. In a similar vein, GFS amortizes garbage collection through batching, while also reducing the engineering complexity of identifying chunks to reclaim to a simple exchange of two messages between a chunkserver and the master

The final point is unique to the proprietary nature of GFS. Unlike open-sourced distributed file systems like HDFS, which can be set up and administrated for any application and by anyone, the genesis of GFS was creating a highly domain-specific storage infrastructure, with far more knowledge of the patterns of its clients than any general purpose distributed file system could hope to have. As a result, many well-reasoned design tradeoffs and optimizations have been made to make GFS as performant as possible for these clients.

Overall, Google file system is a remarkable technological feat, and a great paper that I would truly recommend anyone interested in distributed systems or large-scale storage infrastructure to read.

What is Google File System used for?

The Google file system (GFS) is a distributed file system (DFS) for data-centric applications with robustness, scalability, and reliability [8]. GFS can be implemented in commodity servers to support large-scale file applications with high performance and high reliability.

What is Google File System in Hadoop?

Google File System (GFS) is a scalable distributed file system (DFS) created by Google Inc. and developed to accommodate Google's expanding data processing requirements. GFS provides fault tolerance, reliability, scalability, availability and performance to large networks and connected nodes.

What filesystem does Google Drive use?

Google File System (GFS or GoogleFS, not to be confused with the GFS Linux file system) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware.

Why the Google File System uses large chunk?

All metadata of chunks are stored in the memory of master server in order to reduce the latency and increase the throughput. Large chunks means less metadata, and less metadata means less load time of the metadata.