easy-hadoop - EasyHadoop, Rapid Hadoop Programming - Google Project Hosting
2012-03-10 12:43
337 查看
easy-hadoop - EasyHadoop, Rapid Hadoop Programming - Google Project Hosting
EasyHadoop is a collection of ruby scripts to quickly achieve basic things (such as extracting certain columns from the dataset, calculating frequencies, normalizing strings, etc). Each script can be customized using several command line options available. Below is an example of how these scripts can be used in a Hadoop streaming environment.
Task: Identify all the different search terms issued by a user and there frequency.
SubTasks : Lets list all the process we need to do to achieve the above task
Extract Relevant Columns: User ID and Search Term:
Apply various normalization technique to Search Terms so that we do not differentiate restaurants from restaurant (note missing s)
Count Frequency
EasyHadoop: Now lets see how to achieve each of the above subtasks using EasyHadoop
Extracting Columns: Use ColExtractor.rb to extract relevant User ID (index number 1) and Search Term (index number 2)
ruby ColExtractor.rb --key 1 --value 2
Normalization: Use Normalizer.rb to normalize a particular field. In this case, we only want to normalize the search term. Lets assume, we want to remove any non alphanumeric characters and apply porter stemming
ruby Normalizer.rb --field 1:porter,alphanum
Note, above we use column index number 1 instead of 2 because the output of ColExtractor will only have two fields (UserID TAB SearchTerm). This output will be used as an input to the Normalizer.rb. Thus the column index number of SearchTerm is now 1 (and not 2)
Count: Before, we can do a count, we will need to join UserID and SearchTerm so that they together make a single key. Again Use Column Extractor to achieve this. Assume, we join the two entities using the Pipe (|) character.
ruby ColExtractor.rb --key 0,1 --key_delimiter='|'
Now, we just need to count number of times a particular key appears in our dataset. We can achieve this using Stat.rb as shown below:
ruby Stat.rb --operations:count
First Attempt: Hadoop streaming using EasyHadoop Now, lets convert above steps into a hadoop streaming job. Save the code below in an executable file (.sh)
Scenario I: Frequency Count
Dataset: Consider you have tab delimited search engine logs with the following fields:Request ID | User ID | Search Term | Time | Browser | IP Address | Country |
89aASDF89SAD | 234JKLKLJ | walmart | 09:00 | FF | 201.342.345.23 | USA |
sdf234ljlk23 | 234JKLKLJ | safeway | 13:34 | FF | 201.342.345.23 | USA |
SubTasks : Lets list all the process we need to do to achieve the above task
Extract Relevant Columns: User ID and Search Term:
Apply various normalization technique to Search Terms so that we do not differentiate restaurants from restaurant (note missing s)
Count Frequency
EasyHadoop: Now lets see how to achieve each of the above subtasks using EasyHadoop
Extracting Columns: Use ColExtractor.rb to extract relevant User ID (index number 1) and Search Term (index number 2)
ruby ColExtractor.rb --key 1 --value 2
Normalization: Use Normalizer.rb to normalize a particular field. In this case, we only want to normalize the search term. Lets assume, we want to remove any non alphanumeric characters and apply porter stemming
ruby Normalizer.rb --field 1:porter,alphanum
Note, above we use column index number 1 instead of 2 because the output of ColExtractor will only have two fields (UserID TAB SearchTerm). This output will be used as an input to the Normalizer.rb. Thus the column index number of SearchTerm is now 1 (and not 2)
Count: Before, we can do a count, we will need to join UserID and SearchTerm so that they together make a single key. Again Use Column Extractor to achieve this. Assume, we join the two entities using the Pipe (|) character.
ruby ColExtractor.rb --key 0,1 --key_delimiter='|'
Now, we just need to count number of times a particular key appears in our dataset. We can achieve this using Stat.rb as shown below:
ruby Stat.rb --operations:count
First Attempt: Hadoop streaming using EasyHadoop Now, lets convert above steps into a hadoop streaming job. Save the code below in an executable file (.sh)
EASY_HADOOP="../easy-hadoop" YP_DEFAULTS="../defaults" BASE="/user/example" INPUT="$BASE/raw" JOB1="$BASE/tmp1" JOB2="$BASE/output" hadoop dfs -rmr $JOB1 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar -input $INPUT -output $JOB1 -mapper "ruby easyhadoop/Mapper/ColExtractor.rb --key 0,1 --key_delimiter |" -reducer "easyhadoop/Reducer/Stat.rb --operations:count" -jobconf mapred.map.tasks=50 -jobconf mapred.reduce.tasks=50 -jobconf mapred.output.compress=true -jobconf stream.recordreader.compression=gzip -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec -jobconf mapred.task.timeout=1800000 -cacheArchive hdfs://hadoop.cluster:9000/user/easyhadoop/lib/easyhadoop.zip#easyhadoop #Wait for above execution to be done RT=$? [ $RT -ne 0 ] && echo "First Job Failed:$RT" && exit $RT hadoop dfs -rmr $JOB2 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar
相关文章推荐
- python-message - A message-oriented programming library for python - Google Project Hosting
- mongoose - Mongoose - easy to use web server - Google Project Hosting
- mongoose - Mongoose - easy to use web server - Google Project Hosting
- phantomjs-1.4.1-linux-x86_64-dynamic.tar.gz - phantomjs - PhantomJS 1.4.1 Linux x86_64 (Dynamic build) - headless WebKit with JavaScript API - Google Project Hosting
- android-apktool - A tool for reverse engineering Android apk files - Google Project Hosting
- tstdb - A fast key-value store based on TST(Ternary Search Tree) - Google Project Hosting
- openintents - Make Android applications work together. - Google Project Hosting
- CodeProject: EasySize - Dialog resizing in no time!. Free source code and programming help
- g-proxy - Google App Engine Proxy - Google Project Hosting
- Google Project Hosting
- aranduka - A simple e-book manager and reader - Google Project Hosting
- munin - unxmail - munin的快速配置与使用 - linux服务器的常用基本配置 - Google Project Hosting
- hyk-proxy - 一个支持基于GAE/Seattle/PHP的web proxy框架 (A web proxy framework support implementations on GAE/Seattle/PHP , could be used to break some firewall) - Google Project Hosting
- ltp-service - HIT-CIR LTP Web Service Client - Google Project Hosting
- groovy-http - Groovy HTTP Client - Google Project Hosting
- 【转】Google code Project Hosting Introduction
- pywebkitgtk - Python bindings for the WebKit GTK+ port - Google Project Hosting
- pyscripter - An open-source Python Integrated Development Environment (IDE) - Google Project Hosting
- Google challenges SourceForge in open source project hosting
- jmetaweblog - Java MetaWeblog API - Google Project Hosting