Sector/Sphere:High Performance Distributed File System and Parallel Data Processing Engine
2011-10-24 22:49
369 查看
1. Overview
sector/sphere was created by Dr. Yunhong Gu in 2006 and it is now maintained by a group of open source developers, available from : http://sector.sourceforge.net/
sector : Distrubuted file system
sphere: parallel data processing framework
There is a test, in some cases,sector/sphere is about twice as fast as Hadoop
2. Sector
Sector system architecture:
the figure shows the overall architecture of the sector system, which consistsof three parts:
Security Server: maintains user accounts, user passwd, file access infomation, ip addresses of the authorized slave nodes
Master: maintains the metadata of the files stored in the syste, controls the running of all slave nodes, responds to users' requests
Slaves: the nodes that store the files managed by the system and process the data upon the request of a sector client
The clients includes:
1. sector file system client api: access sector files in applications using the c++ api
2. sector system tools
3. FUSE: mount sector file system as a local directory
4. sphere programming api
A more detail figure:
Feature:
1. Compared to Hadoop, sector does not split user files into blocks, instead, every sector slice is stored as one single file in the native file system
2. Sector runs an independent security server, this design allows different security service providers to be deployed. In addition, multiple sector masters can user the same security service
3. Topology aware and application aware
4. uses UDP for message passing and UDT for transfer
Replication:
1. provide software level falut tolerance(no hardware RAID is required)
2. all files are replicated to a specific number by defalut
3. by default, replication is created on furthest node
UDT:
A high performance data transfer protocol designed for transferring large volumetric datasets over high speed wide area networks. Such settings are typically disadvantageous for the more common TCP
protocol.
UDT uses UDP to transfer bulk data with its own reliability control and congestion control mechanisms. The new protocol can transfer data at a much higher speed than TCP does.
Limitations:
1. File size if limited by available space individual storage nodes
2. Users my need to split their datasets into proper sizes
3. Sector is designed to provide high throughput on large datases, rather than extreme low latency on small files
3. Sphere
Sphere is a parallel data processing engine integrated in Sector and it can be used to process data stored in Sector in parallel,
Sphere users a stream processing computing paradigm. A stream is an abstraction in sphere and it represents either a dataset or a part of a dataset(A sector dataset consists of one of more physical files)
This figure illustrates how sphere processes the segments in a stream.
SPE: Sphere Proccessing Engine
This figure illustrates the basic model that sphere supports. sphere also supports some extensions of this model, which occur quite frequently
1. Processing multiple input streams.
2. Shuffling input streams.
Interested guys
can refer to: “Sector and Sphere: The Design and Implementation of a High Performance Data Cloud”
4. References
Sector and Sphere: The Design and Implementation of a High Performance Data Cloud
http://sector.sourceforge.net/ http://en.wikipedia.org/wiki/Sector/Sphere http://dongxicheng.org/mapreduce/streaming-mapreduce-sphere/ http://en.wikipedia.org/wiki/UDP-based_Data_Transfer_Protocol http://udt.sourceforge.net/
sector/sphere was created by Dr. Yunhong Gu in 2006 and it is now maintained by a group of open source developers, available from : http://sector.sourceforge.net/
sector : Distrubuted file system
sphere: parallel data processing framework
There is a test, in some cases,sector/sphere is about twice as fast as Hadoop
2. Sector
Sector system architecture:
the figure shows the overall architecture of the sector system, which consistsof three parts:
Security Server: maintains user accounts, user passwd, file access infomation, ip addresses of the authorized slave nodes
Master: maintains the metadata of the files stored in the syste, controls the running of all slave nodes, responds to users' requests
Slaves: the nodes that store the files managed by the system and process the data upon the request of a sector client
The clients includes:
1. sector file system client api: access sector files in applications using the c++ api
2. sector system tools
3. FUSE: mount sector file system as a local directory
4. sphere programming api
A more detail figure:
Feature:
1. Compared to Hadoop, sector does not split user files into blocks, instead, every sector slice is stored as one single file in the native file system
2. Sector runs an independent security server, this design allows different security service providers to be deployed. In addition, multiple sector masters can user the same security service
3. Topology aware and application aware
4. uses UDP for message passing and UDT for transfer
Replication:
1. provide software level falut tolerance(no hardware RAID is required)
2. all files are replicated to a specific number by defalut
3. by default, replication is created on furthest node
UDT:
A high performance data transfer protocol designed for transferring large volumetric datasets over high speed wide area networks. Such settings are typically disadvantageous for the more common TCP
protocol.
UDT uses UDP to transfer bulk data with its own reliability control and congestion control mechanisms. The new protocol can transfer data at a much higher speed than TCP does.
Limitations:
1. File size if limited by available space individual storage nodes
2. Users my need to split their datasets into proper sizes
3. Sector is designed to provide high throughput on large datases, rather than extreme low latency on small files
3. Sphere
Sphere is a parallel data processing engine integrated in Sector and it can be used to process data stored in Sector in parallel,
Sphere users a stream processing computing paradigm. A stream is an abstraction in sphere and it represents either a dataset or a part of a dataset(A sector dataset consists of one of more physical files)
This figure illustrates how sphere processes the segments in a stream.
SPE: Sphere Proccessing Engine
This figure illustrates the basic model that sphere supports. sphere also supports some extensions of this model, which occur quite frequently
1. Processing multiple input streams.
2. Shuffling input streams.
Interested guys
can refer to: “Sector and Sphere: The Design and Implementation of a High Performance Data Cloud”
4. References
Sector and Sphere: The Design and Implementation of a High Performance Data Cloud
http://sector.sourceforge.net/ http://en.wikipedia.org/wiki/Sector/Sphere http://dongxicheng.org/mapreduce/streaming-mapreduce-sphere/ http://en.wikipedia.org/wiki/UDP-based_Data_Transfer_Protocol http://udt.sourceforge.net/
相关文章推荐
- What is the difference between distributed and parallel processing operating system?
- High Performance Parallel Database Processing and Grid Databases
- Install _ zimg - A lightweight and high performance image storage and processing system.
- ceph翻译 Ceph: A Scalable, High-Performance Distributed File System
- BI Java 补丁错误处理 :Cannot login to the SAP J2EE Engine using user and password as provided in the Filesystem Secure Store. Enter va
- Yandex Big Data Essentials Week1 Scaling Distributed File System
- Process Algebra for Parallel and Distributed Processing
- Notes on <High Performance MySQL> -- Ch7: Operating System and Hardware Optimization
- 【转】The Hadoop Distributed File System: Architecture and Design
- High Availability for the Hadoop Distributed File System (HDFS)
- 第三章 并行分布式文件系统 Parallel Distributed File System
- High Performance Data Mining - Scaling Algorithms, Applications and Systems
- Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode
- Indexing and Searching on a Hadoop Distributed File System (如何在HDFS文件上建索引)
- APUE 学习笔记 - Chapter 6. System Data File and Infomation
- Bigtable: A Distributed Storage System for Structured Data : part1 Abstract and Introduction
- 《APUE》chapter 6 System Data files and information 学习笔记(加上自己的代码)
- How to move a datafile from a file system to ASM
- Large-scale Incremental Processing Using Distributed Transactions and Notifications