Awesome Hadoop
2015-12-26 12:17
911 查看
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin
Awesome Hadoop
Hadoop
YARN
NoSQL
SQL on Hadoop
Data Management
Workflow, Lifecycle and Governance
Data Ingestion and Integration
DSL
Libraries and Tools
Realtime Data Processing
Distributed Computing and Programming
Packaging, Provisioning and Monitoring
Monitoring
Search
Security
Benchmark
Machine learning and Big Data analytics
Misc.
Resources
Websites
Presentations
Books
Other Awesome Lists
Apache Tez
SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
dumbo - Python module that allows you to easily write and run Hadoop programs.
hadoopy - Python MapReduce library written in Cython.
mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
pydoop - Pydoop is a package that provides a Python API for Hadoop.
hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
White Elephant - Hadoop log aggregator and dashboard
Kiji Project
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
Apache Ignite - Distributed in-memory platform
Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
mpich2-yarn - Running MPICH2 on Yarn
Apache HBase - Apache HBase
Apache Phoenix - A SQL skin over HBase supporting secondary indices
happybase - A developer-friendly Python library to interact with Apache HBase.
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
hindex - Secondary Index for HBase
Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
OpenTSDB - The Scalable Time Series Database
Apache Cassandra
Apache Hive
Apache Phoenix A SQL skin over HBase supporting secondary indices
Pivotal HAWQ - Parallel Postgres on Hadoop
Lingual - SQL interface for Cascading (MR/Tez job generator)
Cloudera Impala
Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
Apache Tajo - Data warehouse system for Apache Hadoop
Apache Drill
Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
Azkaban
Apache Falcon - Data management and processing platform
Apache NiFi - A dataflow system
AirFlow - AirFlow is a platform to programmaticaly author, schedule and monitor data pipelines
Luigi - Python package that helps you build complex pipelines of batch jobs
Suro - Netflix’s distributed Data Pipeline
Apache Sqoop - Apache Sqoop
Apache Kafka - Apache Kafka
Gobblin from LinkedIn - Universal data ingestion framework for Hadoop
Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
vahara - Machine learning and natural language processing with Apache Pig
packetpig - Open Source Big Data Security Analytics
akela - Mozilla’s utility library for Hadoop, HBase, Pig, etc.
seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.
gohadoop - Native go clients for Apache Hadoop YARN.
Hue - A Web interface for analyzing data with Apache Hadoop.
Apache Zeppelin - A web-based notebook that enables interactive data analytics
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
Apache Thrift
Apache Avro - Apache Avro is a data serialization system.
Elephant Bird - Twitter’s collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
Spring for Apache Hadoop
hdfs - A native go client for HDFS
Apache Samza
Apache Spark
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Spark Packages - A community index of packages for Apache Spark
SparkHub - A community site for Apache Spark
Apache Crunch
Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Apache Ambari - Apache Ambari
Ganglia Monitoring System
ankush - A big data cluster management tool that creates and manages clusters of different technologies.
Apache Zookeeper - Apache Zookeeper
Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
Buildoop - Hadoop Ecosystem Builder
Deploop - The Hadoop Deploy System
Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
Apache Solr
SenseiDB - Open-source, distributed, realtime, semi-structured database
Banana - Kibana port for Apache Solr
Apache Sentry - An authorization module for Hadoop
Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
HiBench
Big-Bench
hive-benchmarks
hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
MLlib - MLlib is Apache Spark’s scalable machine learning library.
R - R is a free software environment for statistical computing and graphics.
RHadoop including RHDFS, RHBase, RMR2, plyrmr
RHive RHive, for launching Hive queries from R
Apache Lens
UDF
http://nexr.github.io/hive-udf/
https://github.com/edwardcapriolo/hive_cassandra_udfs
https://github.com/livingsocial/HiveSwarm
https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
https://github.com/karthkk/udfs
https://github.com/kevinweil/elephant-bird - Twitter
https://github.com/lovelysystems/ls-hive
https://github.com/stewi2/hive-udfs
https://github.com/klout/brickhouse
https://github.com/markgrover/hive-translate (PostgreSQL translate())
https://github.com/deanwampler/HiveUDFs
https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
https://github.com/Netflix/Surus
Storage Handler
https://github.com/dvasilen/Hive-Cassandra
https://github.com/yc-huang/Hive-mongo
https://github.com/balshor/gdata-storagehandler
https://github.com/karthkk/hive-hbase-json
https://github.com/sunsuk7tp/hive-hbase-integration
https://bitbucket.org/rodrigopr/redisstoragehandler
https://github.com/zhuguangbin/HiveJDBCStorageHanlder
https://github.com/chimpler/hive-solr
https://github.com/bfemiano/accumulo-hive-storage-manager
SerDe
https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/mochi/hive-json-serde
https://github.com/ogrodnek/csv-serde
https://github.com/parag/HiveJsonSerde
https://github.com/johanoskarsson/hive-json-serde
https://github.com/electrum/hive-serde - JSON
https://github.com/karthkk/hive-hbase-json
Libraries and tools
https://github.com/forward/rbhive
https://github.com/synctree/activerecord-hive-adapter
https://github.com/hrp/sequel-hive-adapter
https://github.com/forward/node-hive
https://github.com/recruitcojp/WebHive
shib - WebUI for query engines: Hive and Presto
clive - Clojure library for interacting with Hive via Thrift
http://www.phphiveadmin.net/
https://github.com/anjuke/hwi
https://code.google.com/a/apache-extras.org/p/hipy/
https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
PyHive - Python interface to Hive and Presto
https://github.com/recruitcojp/OdbcHive
Hive-Sharp
HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
Hive_test- Unit test framework for hive and hive-service
Flume Plugins
Flume MongoDB Sink
Flume HornetQ Channel
Flume MessagePack Source
Flume RabbitMQ source and sink
Flume UDP Source
Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
Flume Custom Serializers
Real-time analytics in Apache Flume
.Net FlumeNG Clients
Hadoop Weekly
The Hadoop Ecosystem Table
Hadoop 1.x vs 2
Apache Hadoop YARN: Yet Another Resource Negotiator
Introducing Apache Hadoop YARN
Apache Hadoop YARN - Background and an Overview
Apache Hadoop YARN - Concepts and Applications
Apache Hadoop YARN - ResourceManager
Apache Hadoop YARN - NodeManager
Migrating to MapReduce 2 on YARN (For Users)
Migrating to MapReduce 2 on YARN (For Operators)
Hadoop and Big Data: Use Cases at Salesforce.com
All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
What is Bigtop, and Why Should You Care?
Hadoop - Distributions and Commercial Support
Ganglia configuration for a small Hadoop cluster and some troubleshooting
Hadoop illuminated - Open Source Hadoop Book
NoSQL Database
10 Best Practices for Apache Hive
Hadoop Operations at Scale
AWS BigData Blog
Hadoop 24/7
An example Apache Hadoop Yarn upgrade
Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning
Hadoop Operations
Apache Hadoop Yarn
HBase: The Definitive Guide
Programming Pig
Programming Hive
Hadoop in Practice, Second Edition
Hadoop in Action, Second Edition
Awesome Hadoop
Hadoop
YARN
NoSQL
SQL on Hadoop
Data Management
Workflow, Lifecycle and Governance
Data Ingestion and Integration
DSL
Libraries and Tools
Realtime Data Processing
Distributed Computing and Programming
Packaging, Provisioning and Monitoring
Monitoring
Search
Security
Benchmark
Machine learning and Big Data analytics
Misc.
Resources
Websites
Presentations
Books
Other Awesome Lists
Hadoop
Apache Hadoop - Apache HadoopApache Tez
SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
dumbo - Python module that allows you to easily write and run Hadoop programs.
hadoopy - Python MapReduce library written in Cython.
mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
pydoop - Pydoop is a package that provides a Python API for Hadoop.
hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
White Elephant - Hadoop log aggregator and dashboard
Kiji Project
Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
Apache Ignite - Distributed in-memory platform
YARN
Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
mpich2-yarn - Running MPICH2 on Yarn
NoSQL
Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.Apache HBase - Apache HBase
Apache Phoenix - A SQL skin over HBase supporting secondary indices
happybase - A developer-friendly Python library to interact with Apache HBase.
Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
hindex - Secondary Index for HBase
Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
OpenTSDB - The Scalable Time Series Database
Apache Cassandra
SQL on Hadoop
SQL on HadoopApache Hive
Apache Phoenix A SQL skin over HBase supporting secondary indices
Pivotal HAWQ - Parallel Postgres on Hadoop
Lingual - SQL interface for Cascading (MR/Tez job generator)
Cloudera Impala
Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
Apache Tajo - Data warehouse system for Apache Hadoop
Apache Drill
Data Management
Apache Calcite - A Dynamic Data Management FrameworkApache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies
Workflow, Lifecycle and Governance
Apache Oozie - Apache OozieAzkaban
Apache Falcon - Data management and processing platform
Apache NiFi - A dataflow system
AirFlow - AirFlow is a platform to programmaticaly author, schedule and monitor data pipelines
Luigi - Python package that helps you build complex pipelines of batch jobs
Data Ingestion and Integration
Apache Flume - Apache FlumeSuro - Netflix’s distributed Data Pipeline
Apache Sqoop - Apache Sqoop
Apache Kafka - Apache Kafka
Gobblin from LinkedIn - Universal data ingestion framework for Hadoop
DSL
Apache Pig - Apache PigApache DataFu - A collection of libraries for working with large-scale data in Hadoop
vahara - Machine learning and natural language processing with Apache Pig
packetpig - Open Source Big Data Security Analytics
akela - Mozilla’s utility library for Hadoop, HBase, Pig, etc.
seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.
Libraries and Tools
Kite Software Development Kit - A set of libraries, tools, examples, and documentationgohadoop - Native go clients for Apache Hadoop YARN.
Hue - A Web interface for analyzing data with Apache Hadoop.
Apache Zeppelin - A web-based notebook that enables interactive data analytics
Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.
Apache Thrift
Apache Avro - Apache Avro is a data serialization system.
Elephant Bird - Twitter’s collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
Spring for Apache Hadoop
hdfs - A native go client for HDFS
Realtime Data Processing
Apache StormApache Samza
Apache Spark
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Distributed Computing and Programming
Apache SparkSpark Packages - A community index of packages for Apache Spark
SparkHub - A community site for Apache Spark
Apache Crunch
Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Packaging, Provisioning and Monitoring
Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystemApache Ambari - Apache Ambari
Ganglia Monitoring System
ankush - A big data cluster management tool that creates and manages clusters of different technologies.
Apache Zookeeper - Apache Zookeeper
Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
Buildoop - Hadoop Ecosystem Builder
Deploop - The Hadoop Deploy System
Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
Search
ElasticSearchApache Solr
SenseiDB - Open-source, distributed, realtime, semi-structured database
Banana - Kibana port for Apache Solr
Security
Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.Apache Sentry - An authorization module for Hadoop
Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.
Benchmark
Big Data BenchmarkHiBench
Big-Bench
hive-benchmarks
hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
Machine learning and Big Data analytics
Apache MahoutOryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
MLlib - MLlib is Apache Spark’s scalable machine learning library.
R - R is a free software environment for statistical computing and graphics.
RHadoop including RHDFS, RHBase, RMR2, plyrmr
RHive RHive, for launching Hive queries from R
Apache Lens
Misc.
Hive PluginsUDF
http://nexr.github.io/hive-udf/
https://github.com/edwardcapriolo/hive_cassandra_udfs
https://github.com/livingsocial/HiveSwarm
https://github.com/ThinkBigAnalytics/Hive-Extensions-from-Think-Big-Analytics
https://github.com/karthkk/udfs
https://github.com/kevinweil/elephant-bird - Twitter
https://github.com/lovelysystems/ls-hive
https://github.com/stewi2/hive-udfs
https://github.com/klout/brickhouse
https://github.com/markgrover/hive-translate (PostgreSQL translate())
https://github.com/deanwampler/HiveUDFs
https://github.com/myui/hivemall (Machine Learning UDF/UDAF/UDTF)
https://github.com/edwardcapriolo/hive-geoip (GeoIP UDF)
https://github.com/Netflix/Surus
Storage Handler
https://github.com/dvasilen/Hive-Cassandra
https://github.com/yc-huang/Hive-mongo
https://github.com/balshor/gdata-storagehandler
https://github.com/karthkk/hive-hbase-json
https://github.com/sunsuk7tp/hive-hbase-integration
https://bitbucket.org/rodrigopr/redisstoragehandler
https://github.com/zhuguangbin/HiveJDBCStorageHanlder
https://github.com/chimpler/hive-solr
https://github.com/bfemiano/accumulo-hive-storage-manager
SerDe
https://github.com/rcongiu/Hive-JSON-Serde
https://github.com/mochi/hive-json-serde
https://github.com/ogrodnek/csv-serde
https://github.com/parag/HiveJsonSerde
https://github.com/johanoskarsson/hive-json-serde
https://github.com/electrum/hive-serde - JSON
https://github.com/karthkk/hive-hbase-json
Libraries and tools
https://github.com/forward/rbhive
https://github.com/synctree/activerecord-hive-adapter
https://github.com/hrp/sequel-hive-adapter
https://github.com/forward/node-hive
https://github.com/recruitcojp/WebHive
shib - WebUI for query engines: Hive and Presto
clive - Clojure library for interacting with Hive via Thrift
http://www.phphiveadmin.net/
https://github.com/anjuke/hwi
https://code.google.com/a/apache-extras.org/p/hipy/
https://github.com/dmorel/Thrift-API-HiveClient2 (Perl - HiveServer2)
PyHive - Python interface to Hive and Presto
https://github.com/recruitcojp/OdbcHive
Hive-Sharp
HiveRunner - An Open Source unit test framework for hadoop hive queries based on JUnit4
Beetest - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
Hive_test- Unit test framework for hive and hive-service
Flume Plugins
Flume MongoDB Sink
Flume HornetQ Channel
Flume MessagePack Source
Flume RabbitMQ source and sink
Flume UDP Source
Stratio Ingestion - Custom sinks: Cassandra, MongoDB, Stratio Streaming and JDBC
Flume Custom Serializers
Real-time analytics in Apache Flume
.Net FlumeNG Clients
Resources
Various resources, such as books, websites and articles.Websites
Useful websites and articlesHadoop Weekly
The Hadoop Ecosystem Table
Hadoop 1.x vs 2
Apache Hadoop YARN: Yet Another Resource Negotiator
Introducing Apache Hadoop YARN
Apache Hadoop YARN - Background and an Overview
Apache Hadoop YARN - Concepts and Applications
Apache Hadoop YARN - ResourceManager
Apache Hadoop YARN - NodeManager
Migrating to MapReduce 2 on YARN (For Users)
Migrating to MapReduce 2 on YARN (For Operators)
Hadoop and Big Data: Use Cases at Salesforce.com
All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.
What is Bigtop, and Why Should You Care?
Hadoop - Distributions and Commercial Support
Ganglia configuration for a small Hadoop cluster and some troubleshooting
Hadoop illuminated - Open Source Hadoop Book
NoSQL Database
10 Best Practices for Apache Hive
Hadoop Operations at Scale
AWS BigData Blog
Presentations
Hadoop Summit Presentations - Slide decks from Hadoop SummitHadoop 24/7
An example Apache Hadoop Yarn upgrade
Apache Hadoop In Theory And Practice
Hadoop Operations at LinkedIn
Hadoop Performance at LinkedIn
Docker based Hadoop provisioning
Books
Hadoop: The Definitive GuideHadoop Operations
Apache Hadoop Yarn
HBase: The Definitive Guide
Programming Pig
Programming Hive
Hadoop in Practice, Second Edition
Hadoop in Action, Second Edition
Other Awesome Lists
Other amazingly awesome lists can be found in the awesome-awesomeness list.相关文章推荐
- 在Linux服务器上安装配置socks5代理的教程
- sympy官方文档网站无法打开问题解决
- NSOperation与GCD
- Nginx/LVS/HAProxy负载均衡软件的优缺点详解
- Linux如何kill杀掉进程
- 监控提示message
- 私用 Hadoop BigTable
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- 【OpenSource】【Log】Log工具类
- Linux_Bash常用脚本
- linux使用velocity出现的一个奇葩问题-Unable to find resource
- Hadoop:解析Partition
- Linux_自制系统服务启动脚本
- 短信API 短信接口说明文档