Wednesday, May 14, 2014

GFS(Google File System)

GFS:

- provides familiar file system interface
- files are stored in directories and identified by path names
- it supports create, delete, open, close, read and write files.
- GFS has snapshot and record append operations.
- Snapshot to create copy of file or a directory at low cost
- Record append allows multiple clients to append data to same file concurrently
- Record is useful for multi way merge results.

GFS Architecture:

GFS cluster consists of:
- single master
- multiple chunk servers accessed by multiple clients
- Files are divided into fixed size chunks
- Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by master at the time of chunk creation.
- Chunk servers store chunks on local disks as Linux files.
- Each chunk is replicated on multiple chunk servers
- By default we store three replicas.
- Users can designate different replication levels for different regions of file namespace.
- Master maintains all file system metadata.
- This includes
Namespace
Access control information
Mapping from files to chunks
Current locations of chunks.
Chunk lease management
Garbage collection of orphaned chunk
Chunk migration between chunkservers.
- Master communicates with each chunkserver in heartbeat messages to give instructions to collect its state.
- Client interacts with master for metadata operations, but all data bearing communication goes to chunkservers.

Single Master:

- Single master simplifies our design
- It enables master to make sophisticated chunk placement and replication decisions using global knowledge.
- Clients never read and write file data throughout the master.
- Instead, client asks the master which chunk servers it should contact.

Simple read operation:

- Using the fixed chunk size, client translates the file name and byte offset specified by the application into chunk index within the file.
- Then it sends master request containing the file name and chunk index.
- Master replies with the chunk handle and locations of the replicas.
- Client caches this information using file name and chunk index as the key
- Client then sends request to one of the replicas, most likely the closest one
- Request specifies the chunk handle and byte range within that chunk
- Further reads of same chunk require no more client-master interaction until the cached information expires.
- Client asks for multiple chunks in same request and master can also include information for chunks
immediately following those requested.

Chunk Size:

- Is one of the key design parameters
- We have chosen 64 MB, which is much larger than typical file system block sizes.
- Each chunk is stored as plain linux file on chunk server and extended as needed
- Large chunk size offers several important advantages:
It reduces clients need to interact with master as read and write on same chunk requires only one initial request to master.
Since on large chunk, client is more likely to perform many operations on given chunk.
It reduces network overhead by keeping persistent TCP connection to chunkserver for extended period of time.
It reduces the size of the metadata stored on master.

Disadvantages of large chunk size:

- Chunk servers storing those chunks may become hot spots if many clients are accessing same file.

Metadata:

- Master stores three major types of metadata:
file and chunk namespaces
mapping from files to chunks
locations of each chunks replicas
- All metadata is kept in the masters memory.
- Master stores namespaces and file to chunk mapping which are kept persistent by logging mutations
and replicated on remote machines.
- Using log allows to update master state simply, reliably, and without risking inconsistencies in event of master crash.
- Master does not store chunk location information persistently.
- It asks each chunkserver about its chunks at master startup and whenever chunkserver joins the cluster

Operation Log:

- Contains historical record of critical metadata changes
- Central to GFS
- Serves as logical timeline that defines the order of concurrent operations.
- Master recovers its file system by replaying the operation log.
- To minimize startup time, we must keep the log small.
- Master checkpoints its state whenever the log grows beyond certain size
- In this way, it can recover by loading the latest checkpoint from local disk
- Master switches to the new log file and creates a new checkpoint in separate thread.
- New checkpoint includes all mutations before the switch
- Recovery needs only the latest complete checkpoint and subsequent log files.
- Older checkpoints and log files can be freely deleted, though we keep few around to guard against
catastrophes.

Consistency Model:

Guarantees by GFS:

- File namespace mutations are atomic.
- They are handled exclusively by the master: namespace locking guarantees atomicity and correctness.
- File region is consistent if all clients see the same data. regardless of which replicas they read from.
- Failed mutation makes the region inconsistent; different clients may see different data at different times.
- Data mutations may be writes or record appends.
- A write causes data to be written at application specified file offset.
- A record append causes data to be appended atomically atleast once even in presence of the current mutations.

SYSTEM INTERACTIONS:

Leases and Mutation Order

- Mutation is an operation that changes contents of metadata of a chunk such as write or append operation.
- Each mutation is performed at all chunks replicas.
- Master grants chunk lease to one of the replicas, which we call the primary.
- Primary picks up serial order for all mutations to the chunk.
- All replicas follow this order when applying mutations.
- Lease mechanism is designed to minimize management overhead at the master.
- Lease has initial timeout of 60 seconds.
- Master may sometimes try to revoke the lease before it expires.
- Even if the master loses communication with the primary, it can safely grant new lease to another replica after the old lease expires.

Write control flow:

- Client asks the master about chunkserver holding the current lease for the chunk and locations of other replicas.
- If no one has lease, the master grants one to the replica it chooses
- Master replies with identity of the primary and locations of the other secondary replicas.
Client caches this data for future mutations. It needs to contact master again only when the
primary becomes unreachable or replies that it no longer holds a lease
- The client pushes the data to all the replicas. Client can do so in any order. Each chunkserver
will store the data in an internal LRU buffer cache until the data is used or aged out.
- Once all the replicas have acknowledged receiving the data, the client sends a write request
to the primary. The request identifies the data pushed earlier to all the replicas. Primary assigns
consecutive serial numbers to all the mutations it receives, possibly from multiple clients.
- Primary forwards the write request to all secondary replicas. Each secondary replica applies
mutations in the same serial number order assigned by primary.
- Secondaries all reply to the primary indicating that they have completed the operation.
- Primary replies to the client. Any errors encountered at any of the replicas are reported to
the client.

If the write by the application is large or straddles a chunk boundary, GFS client code breaks
it down into multiple write operations.

Data flow:

- While control flows from client to the primary and then to all secondaries, data is pushed
linearly along carefully picked chain of chunkservers in pipelined fashion.
- Our goals are to fully utilize each machines network bandwidth, avoid network bottlenecks
and high latency links, and minimize the latency to push through all the data.
- To avoid network bottlenecks and high latency links as much as possible, each machine forwards
the data to the closest machine in the network topology that has not received it.
- Once chunkserver receives some data, it starts forwarding those immediately.
- Sending data immediately does not reduce the receive rate.
- Our network links are typically 100Mbps(T). 1MB can ideally be distributed in about 80ms.

Atomic record appends:

- Record append is heavily used by our distributed applications in which many clients on different machines append to the same file concurrently.
- Client pushes the data to all replicas of the last chunk of the file.
- Then, it sends its request to the primary.
- Primary checks to see if appending the record to current chunk would cause the
chunk to exceed the maximum size(64MB).
- If so, it pads the chunk to the maximum size, tells secondaries to do same, and replies
to the client indicating that operation should be retried on next chunk.
- If record fits in maximum size, primary appends the data to its replica, tells the
secondaries to write the data at exact offset where it has, and finally replied success to
the client.
- GFS does not guarantee that all replicas are bytewise identical.
It only guarantees that data is written at least once as an atomic unit.

Snapshot:

- Snapshot operation makes a copy of a file or directory tree almost instantaneously, while
minimizing any interruptions of ongoing mutations.



No comments:

Post a Comment