The Goole File System笔记
2012-02-26 17:13
507 查看
Assumptions:
built from many commodity machines and able to detect and recover from failures
Store modest number of large files and thus optimize for it
Primarily two kinds of read
large streaming read & small random read
thus batch and sort small read to advance steadily through the file
Most write is large & sequential append to files
Atomicity for cocurrently append to the same file
High sustained bandwidth more important than low latency
Normal file system interface
Snapshot on a file or directory & Record append useful for implementing multi-way merge results and producer-cosumer queues
Architecture
Chunkserver stores chunks on local disk as Linux files and each chunk is replicated on mutiple chunkservers(by default 3, each chunk replica is a normal file)
Master Responsibilty
Maitain file system metadata
name space
access control information
mapping from files to chunks and chunk's current locations
control system activities
chunk lease management
garbage collection of orphaned chunks
chunk migratin between chunk servers
No file data cache on chunk server or client
no cache coherence issue
client cache meta data(no file data because stream through too large file to be cached)
chunk server Linux buffer cache will do cache work
Chunk Size(default 64MB)
Large chunk size will reduce metadata size, which will enable master keep them in memory, reduce network interaction overhead between master and chunkserver, keep client and chunkserver tcp connection for a long time and reduce
the connection setup overhead.
For small files, large chunk size may cause some chunkservers to be hot spot. One solution is to increase the replication factor
Metadata
all kept in master's memory
easy and efficient for maste to scan through it
file names stored compactly with prefix compression
file and chunk namespaces, mapping from file to chunks are also kept persistentby logging mutations tooperation logs stonred on master's local disk and replicated on remote
machines
chunk locations not persistent
poll chunkservers for the info at its startup and periodically thereafter
eliminate problem of keeping master and chunkservers in sync
Operation log
central to GFS
master checkpoints its state when log grows beyond a certain size
when recover, first load latest checkpoints from local disk and then replay log
two log files which is switched when checkpoint is created
Consistency Model
file namespace mutations are atomic by exculsive lock of master
file region state after mutations
A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.
A record append causes data (the “record”) to be appended atomically at least once(may dup)even in the presence of concurrent mutations, but at an
offset of GFS’s choosing
file region is guaranteed defined after a sequence of successful mutations and contains last written data
apply mutations in the same order to all replicas
using chunk version number to detect stale replicas which will be garbage collected
client cache may be stale before cache time out
detect data corruption by checksum and restore from valid replicas
Implications for application
relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
System Interactions
client asks master for primary chunkserver holding current lease for the chunk and other replicas' location. If no one has a lease, the master grants one to a replica it chooses.
The master replies with the primary identy and otherreplicas' location which client caches. It needs to
contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas in any proper order which is stored in an internal LRU buffer cache of each chunkserver until it is used or aged out.
Once all the replicas have acknowledged receiving the data, the client sends a write requestto the primary. The primary assigns consecutive serial numbers to all the mutations it receives and applies
the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas.
The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client. Any errors encountered at any of the replicas are reported to the client(??Primary发起重试改进). Client will usually retry in error case.
Application cross boundary write is breaked down to mutiple mutations by GFS client code
Data flow is decoupled with control flow and data is pushed linearly along carefully picked chunkserver chain.
Snapshot
makes a copy of a file or a directory tree (the “source”) almost instantaneously
When the master receives a snapshot request, it first revokes any outstanding leases on the chunksin the files it is about to snapshot.
After the leases have been revoked or have expired, the master logs the operation to disk
It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. The newly created snapshot files(GFS中没有的namespace只是一个prefix tree,目录中没有像普通文件系统的entry之类需要保护的内容)
point to the same chunks as the source files
The first time a client wants to write to a chunk C after the snapshot operation, it sends a request to the master to find the current lease holder
The master notices that the reference count for chunkC is greater than one. It defers replying to the client request and instead picks a new chunk handle C’. It then asks each chunkserver that has a current replica of C to create
a new chunkcalled C.
From this point, request handling is no different from that for any chunk: the master grants one of the replicas a lease on the new chunkC’ and replies to the client, which can write the chunk normally.
Master Operations
Namespace management and locking
Have no per-directory structures that list all files for that directory; hard/symbol link not supportedNamespace represented as a prefix tree mapping full path to metadata(如何持久化??)Each node has a read-write lockFile creation does not require write-lock its parent directory, only write-lock the created file nameFor each operation, aquire read-lock on all nodes on the path and read/write lock on the leaflock is ordered by level in namespace tree and lexicographically within the same level
Replica Management
two purposes
maximize data reliability and
availability
maximize networkbandwidth utilization
Chunk creation, re-replication, ere-balancing
equalizing diskspace utilizationlimiting active clone operations on any single chunkserverspreading replicas across racks
priority-based
Garbage collection
deletion is logged and the file is renamed to a hidden file with deletion timestamp includedthe hidden file's metadata is deleted during master's regular scan of file namespace namespace if it existed for more than 3 daysorphaned chunks' metadata is erased similarlyIn a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata.
The chunkserver is free to delete its replicas of such chunks.
Stale Replica detection
chunk version number that is updated each time a lease is granted
HA
Fast recovery and chunk/master replications
Data Integrity
checksum for 64k blocks stored on each chunkservers seperated from user data
Performance
Write: half theoretical limitRecord Append: 6MB for 16 clients; Performance is limited by the networkbandwidth of the chunkservers that store the last chunkof the file, independent of the num-
ber of clients(??为什么Record append from different client to 1 file不能并发??)
built from many commodity machines and able to detect and recover from failures
Store modest number of large files and thus optimize for it
Primarily two kinds of read
large streaming read & small random read
thus batch and sort small read to advance steadily through the file
Most write is large & sequential append to files
Atomicity for cocurrently append to the same file
High sustained bandwidth more important than low latency
Normal file system interface
Snapshot on a file or directory & Record append useful for implementing multi-way merge results and producer-cosumer queues
Architecture
Chunkserver stores chunks on local disk as Linux files and each chunk is replicated on mutiple chunkservers(by default 3, each chunk replica is a normal file)
Master Responsibilty
Maitain file system metadata
name space
access control information
mapping from files to chunks and chunk's current locations
control system activities
chunk lease management
garbage collection of orphaned chunks
chunk migratin between chunk servers
No file data cache on chunk server or client
no cache coherence issue
client cache meta data(no file data because stream through too large file to be cached)
chunk server Linux buffer cache will do cache work
Chunk Size(default 64MB)
Large chunk size will reduce metadata size, which will enable master keep them in memory, reduce network interaction overhead between master and chunkserver, keep client and chunkserver tcp connection for a long time and reduce
the connection setup overhead.
For small files, large chunk size may cause some chunkservers to be hot spot. One solution is to increase the replication factor
Metadata
all kept in master's memory
easy and efficient for maste to scan through it
file names stored compactly with prefix compression
file and chunk namespaces, mapping from file to chunks are also kept persistentby logging mutations tooperation logs stonred on master's local disk and replicated on remote
machines
chunk locations not persistent
poll chunkservers for the info at its startup and periodically thereafter
eliminate problem of keeping master and chunkservers in sync
Operation log
central to GFS
master checkpoints its state when log grows beyond a certain size
when recover, first load latest checkpoints from local disk and then replay log
two log files which is switched when checkpoint is created
Consistency Model
file namespace mutations are atomic by exculsive lock of master
file region state after mutations
A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.
A record append causes data (the “record”) to be appended atomically at least once(may dup)even in the presence of concurrent mutations, but at an
offset of GFS’s choosing
file region is guaranteed defined after a sequence of successful mutations and contains last written data
apply mutations in the same order to all replicas
using chunk version number to detect stale replicas which will be garbage collected
client cache may be stale before cache time out
detect data corruption by checksum and restore from valid replicas
Implications for application
relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
System Interactions
client asks master for primary chunkserver holding current lease for the chunk and other replicas' location. If no one has a lease, the master grants one to a replica it chooses.
The master replies with the primary identy and otherreplicas' location which client caches. It needs to
contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas in any proper order which is stored in an internal LRU buffer cache of each chunkserver until it is used or aged out.
Once all the replicas have acknowledged receiving the data, the client sends a write requestto the primary. The primary assigns consecutive serial numbers to all the mutations it receives and applies
the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas.
The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client. Any errors encountered at any of the replicas are reported to the client(??Primary发起重试改进). Client will usually retry in error case.
Application cross boundary write is breaked down to mutiple mutations by GFS client code
Data flow is decoupled with control flow and data is pushed linearly along carefully picked chunkserver chain.
Snapshot
makes a copy of a file or a directory tree (the “source”) almost instantaneously
When the master receives a snapshot request, it first revokes any outstanding leases on the chunksin the files it is about to snapshot.
After the leases have been revoked or have expired, the master logs the operation to disk
It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree. The newly created snapshot files(GFS中没有的namespace只是一个prefix tree,目录中没有像普通文件系统的entry之类需要保护的内容)
point to the same chunks as the source files
The first time a client wants to write to a chunk C after the snapshot operation, it sends a request to the master to find the current lease holder
The master notices that the reference count for chunkC is greater than one. It defers replying to the client request and instead picks a new chunk handle C’. It then asks each chunkserver that has a current replica of C to create
a new chunkcalled C.
From this point, request handling is no different from that for any chunk: the master grants one of the replicas a lease on the new chunkC’ and replies to the client, which can write the chunk normally.
Master Operations
Namespace management and locking
Have no per-directory structures that list all files for that directory; hard/symbol link not supportedNamespace represented as a prefix tree mapping full path to metadata(如何持久化??)Each node has a read-write lockFile creation does not require write-lock its parent directory, only write-lock the created file nameFor each operation, aquire read-lock on all nodes on the path and read/write lock on the leaflock is ordered by level in namespace tree and lexicographically within the same level
Replica Management
two purposes
maximize data reliability and
availability
maximize networkbandwidth utilization
Chunk creation, re-replication, ere-balancing
equalizing diskspace utilizationlimiting active clone operations on any single chunkserverspreading replicas across racks
priority-based
Garbage collection
deletion is logged and the file is renamed to a hidden file with deletion timestamp includedthe hidden file's metadata is deleted during master's regular scan of file namespace namespace if it existed for more than 3 daysorphaned chunks' metadata is erased similarlyIn a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata.
The chunkserver is free to delete its replicas of such chunks.
Stale Replica detection
chunk version number that is updated each time a lease is granted
HA
Fast recovery and chunk/master replications
Data Integrity
checksum for 64k blocks stored on each chunkservers seperated from user data
Performance
Write: half theoretical limitRecord Append: 6MB for 16 clients; Performance is limited by the networkbandwidth of the chunkservers that store the last chunkof the file, independent of the num-
ber of clients(??为什么Record append from different client to 1 file不能并发??)
相关文章推荐
- 论文阅读笔记 - The Google File System
- 文献笔记3-large scale analysis of the edonkey p2p file sharing system
- 论文笔记:《the Google File System》
- vmware Unable to open kernel device "\\.\Global\vmx86": The system cannot find the file 的解决方法
- ORA-27054: NFS file system where the file is created or resides is not mounted with correct options
- File C:\Temp\Test.ps1 cannot be loaded because the execution of scripts is disabledon this system.
- SublimeText 2编译python出错的解决方法(The system cannot find the file specified)
- Resource is out of sync with the file system解决办法
- Resource is out of sync with the file system的解决办法
- An error occurred during the file system check
- The disk contains an unclean file system
- Access the Linux kernel using the /proc filesystem
- 虚拟机linux挂载光盘显示:mount: you must specify the filesystem type
- FAT file system-What are the two reserved clusters at the start of the FAT for?
- Accessing the Local File System with Flex
- The Google File System 中文版论文(下)(转载)
- 谷歌三大核心技术(一)The Google File System中文版
- Working with the Flash File System
- 3.The Hadoop Distributed File System
- 解决eclipse中出现Resource is out of sync with the file system问题