您的位置：首页 > 运维架构

Awesome Hadoop

2015-12-26 12:17 911 查看

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHP, Awesome Python and Awesome Sysadmin

Awesome Hadoop

Hadoop

YARN

NoSQL

SQL on Hadoop

Data Management

Workflow, Lifecycle and Governance

Data Ingestion and Integration

DSL

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

Packaging, Provisioning and Monitoring

Monitoring

Search

Security

Benchmark

Machine learning and Big Data analytics

Misc.

Resources

Websites

Presentations

Books

Other Awesome Lists

Hadoop

Apache Hadoop - Apache Hadoop

Apache Tez

SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework

Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.

dumbo - Python module that allows you to easily write and run Hadoop programs.

hadoopy - Python MapReduce library written in Cython.

mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.

pydoop - Pydoop is a package that provides a Python API for Hadoop.

hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.

White Elephant - Hadoop log aggregator and dashboard

Kiji Project

Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.

Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

Crunch - Go-based toolkit for ETL and feature extraction on Hadoop

Apache Ignite - Distributed in-memory platform

YARN

Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.

Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.

mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

Apache HBase - Apache HBase

Apache Phoenix - A SQL skin over HBase supporting secondary indices

happybase - A developer-friendly Python library to interact with Apache HBase.

Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

hindex - Secondary Index for HBase

Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

OpenTSDB - The Scalable Time Series Database

Apache Cassandra

SQL on Hadoop

SQL on Hadoop

Apache Hive

Apache Phoenix A SQL skin over HBase supporting secondary indices

Pivotal HAWQ - Parallel Postgres on Hadoop

Lingual - SQL interface for Cascading (MR/Tez job generator)

Cloudera Impala

Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.

Apache Tajo - Data warehouse system for Apache Hadoop

Apache Drill

Data Management

Apache Calcite - A Dynamic Data Management Framework

Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

Apache Oozie - Apache Oozie

Azkaban

Apache Falcon - Data management and processing platform

Apache NiFi - A dataflow system

AirFlow - AirFlow is a platform to programmaticaly author, schedule and monitor data pipelines

Luigi - Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

Apache Flume - Apache Flume

Suro - Netflix’s distributed Data Pipeline

Apache Sqoop - Apache Sqoop

Apache Kafka - Apache Kafka

Gobblin from LinkedIn - Universal data ingestion framework for Hadoop

DSL

Apache Pig - Apache Pig

Apache DataFu - A collection of libraries for working with large-scale data in Hadoop

vahara - Machine learning and natural language processing with Apache Pig

packetpig - Open Source Big Data Security Analytics

akela - Mozilla’s utility library for Hadoop, HBase, Pig, etc.

seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig

PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don’t need to know much about Pig to use it.

Libraries and Tools

Kite Software Development Kit - A set of libraries, tools, examples, and documentation

gohadoop - Native go clients for Apache Hadoop YARN.

Hue - A Web interface for analyzing data with Apache Hadoop.

Apache Zeppelin - A web-based notebook that enables interactive data analytics

Jumbune - Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs.

Apache Thrift

Apache Avro - Apache Avro is a data serialization system.

Elephant Bird - Twitter’s collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Spring for Apache Hadoop

hdfs - A native go client for HDFS

Realtime Data Processing

Apache Storm

Apache Samza

Apache Spark

Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Distributed Computing and Programming

Apache Spark

Spark Packages - A community index of packages for Apache Spark

SparkHub - A community site for Apache Spark

Apache Crunch

Cascading - Cascading is the proven application development platform for building data applications on Hadoop.

Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Packaging, Provisioning and Monitoring

Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

Apache Ambari - Apache Ambari

Ganglia Monitoring System

ankush - A big data cluster management tool that creates and manages clusters of different technologies.

Apache Zookeeper - Apache Zookeeper

Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework

Buildoop - Hadoop Ecosystem Builder

Deploop - The Hadoop Deploy System

Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.

inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Search

ElasticSearch

Apache Solr

SenseiDB - Open-source, distributed, realtime, semi-structured database

Banana - Kibana port for Apache Solr

Security

Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Apache Sentry - An authorization module for Hadoop

Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Benchmark

Big Data Benchmark

HiBench

Big-Bench

hive-benchmarks

hive-testbench - Testbench for experimenting with Apache Hive at any data scale.

YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Machine learning and Big Data analytics

Apache Mahout

Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning

MLlib - MLlib is Apache Spark’s scalable machine learning library.

R - R is a free software environment for statistical computing and graphics.

RHadoop including RHDFS, RHBase, RMR2, plyrmr

RHive RHive, for launching Hive queries from R

Apache Lens

Misc.

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Hadoop Weekly

The Hadoop Ecosystem Table

Hadoop 1.x vs 2

Apache Hadoop YARN: Yet Another Resource Negotiator

Introducing Apache Hadoop YARN

Apache Hadoop YARN - Background and an Overview

Apache Hadoop YARN - Concepts and Applications

Apache Hadoop YARN - ResourceManager

Apache Hadoop YARN - NodeManager

Migrating to MapReduce 2 on YARN (For Users)

Migrating to MapReduce 2 on YARN (For Operators)

Hadoop and Big Data: Use Cases at Salesforce.com

All you wanted to know about Hadoop, but were too afraid to ask: genealogy of elephants.

What is Bigtop, and Why Should You Care?

Hadoop - Distributions and Commercial Support

Ganglia configuration for a small Hadoop cluster and some troubleshooting

Hadoop illuminated - Open Source Hadoop Book

NoSQL Database

10 Best Practices for Apache Hive

Hadoop Operations at Scale

AWS BigData Blog

Presentations

Hadoop Summit Presentations - Slide decks from Hadoop Summit

Hadoop 24/7

An example Apache Hadoop Yarn upgrade

Apache Hadoop In Theory And Practice

Hadoop Operations at LinkedIn

Hadoop Performance at LinkedIn

Docker based Hadoop provisioning

Books

Hadoop: The Definitive Guide

Hadoop Operations

Apache Hadoop Yarn

HBase: The Definitive Guide

Programming Pig

Programming Hive

Hadoop in Practice, Second Edition

Hadoop in Action, Second Edition

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness list.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航