您的位置:首页 > 数据库 > SQL

apache sqoop1.99.3+hadoop2.5.2+mysql5.0.7环境构筑以及数据导入导出

2016-05-11 10:47 731 查看

概要

为了调查hadoop生态圈里的制品,特地的了解了一下RDBMS和hdfs之间数据的导入和导出工具,并且调查了一些其他同类的产品,得出来的结论是:都是基于sqoop做的二次开发或者说是webUI包装,实质还是用的sqoop。比如pentaho的PDI,Oracle的ODI,都是基于此,另外,Hortnetwork公司的sandbox,Hue公司的Hue webUI,coulder的coulder manger,做个就更不错了,差不多hadoop下的制品都集成了,部署也不是很复杂,还是很强大的。

关于sqoop

apache sqoop现阶段分了2个系列制品,一个是sqoop1系列的,另一个是sqoop2系列的。相比较,sqoop1相对比较成熟,bug较少,但结构比较单一,现阶段的稳定版是1.4.6;sqoop2系列基于sqoop1的基础上,做了很大的改进,client跟server端分离,job跟connection做到了集成化管理,使用方面来看,比sqoop1简单多了,但部署比较复杂,且sqoop1不能跟sqoop2兼容,既存的一些应用脚本几乎要重写,但大的趋势来看,sqoop2将会变成主流。

#sqoop2从1.99.2以后,就没法将数据导入到hbase中,这一点,以后预定会在sqoop2.0.0这个稳定版中解决掉。

环境搭建

环境搭建依据官网的提示,这里着重说一下需要注意的是事项:

1.server/conf/sqoop.properties文件中需要修改的地方

org.apache.sqoop.repository.jdbc.url=jdbc:derby:@BASEDIR@/repository/sqoop;create=true

这里的sqoop是事先在mysql这边创建的数据库,并赋予了权限:

create database sqoop ;

create user sqoop identified by '123456';

grant privileges on sqoop.* to sqoop;

flush privileges;


2.也是在sqoop.properties文件中修改hadoop的位置

org.apache.sqoop.Submission.engine.mapreduce.configuration.directory=your-hadoop-cluster-location

3.server/conf/catalina.properties文件中,追加hadoop/share下的所有lib文件。

common.loader=${Catalina.base}/lib,${CAtalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,your-hadoop-libs

4.【重要】修改hadoop的yarn-site.xml文件,追加如下信息:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>


测试

hadoop跟sqoop环境启动。

1.启动hadoop start-all.sh脚本

2.启动sqoop

1.sqoop server以demaon启动后,会有如下信息:

[root@sv001 sqoop-1.99.3-bin-hadoop200]# ./bin/sqoop.sh server run
Sqoop home directory: /home/project/sqoop-1.99.3-bin-hadoop200
Setting SQOOP_HTTP_PORT:     12000
Setting SQOOP_ADMIN_PORT:     12001
Using   CATALINA_OPTS:
Adding to CATALINA_OPTS:    -Dsqoop.http.port=12000 -Dsqoop.admin.port=12001
Using CATALINA_BASE:   /home/project/sqoop-1.99.3-bin-hadoop200/server
Using CATALINA_HOME:   /home/project/sqoop-1.99.3-bin-hadoop200/server
Using CATALINA_TMPDIR: /home/project/sqoop-1.99.3-bin-hadoop200/server/temp
Using JRE_HOME:        /usr/java/jdk1.7.0_67
Using CLASSPATH:       /home/project/sqoop-1.99.3-bin-hadoop200/server/bin/bootstrap.jar
May 11, 2016 6:56:00 PM org.apache.catalina.core.AprLifecycleListener init
INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
May 11, 2016 6:56:00 PM org.apache.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-12000
May 11, 2016 6:56:00 PM org.apache.catalina.startup.Catalina load
INFO: Initialization processed in 634 ms
May 11, 2016 6:56:00 PM org.apache.catalina.core.StandardService start
INFO: Starting service Catalina
May 11, 2016 6:56:00 PM org.apache.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.36
May 11, 2016 6:56:00 PM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive sqoop.war
2016-05-11 18:56:00,972 INFO  [main] core.SqoopServer (SqoopServer.java:initialize(47)) - Booting up Sqoop server
2016-05-11 18:56:00,979 INFO  [main] core.PropertiesConfigurationProvider (PropertiesConfigurationProvider.java:initialize(96)) - Starting config file poller thread
log4j: Parsing for [root] with value=[WARN, file].
log4j: Level token is [WARN].
log4j: Category root set to WARN
log4j: Parsing appender named "file".
log4j: Parsing layout options for "file".
log4j: Setting property [conversionPattern] to [%d{ISO8601} %-5p %c{2} [%l] %m%n].
log4j: End of parsing for "file".
log4j: Setting property [file] to [@LOGDIR@/sqoop.log].
log4j: Setting property [maxBackupIndex] to [5].
log4j: Setting property [maxFileSize] to [25MB].
log4j: setFile called: @LOGDIR@/sqoop.log, true
log4j: setFile ended
log4j: Parsed "file" options.
log4j: Parsing for [org.apache.sqoop] with value=[DEBUG].
log4j: Level token is [DEBUG].
log4j: Category org.apache.sqoop set to DEBUG
log4j: Handling log4j.additivity.org.apache.sqoop=[null]
log4j: Parsing for [org.apache.derby] with value=[INFO].
log4j: Level token is [INFO].
log4j: Category org.apache.derby set to INFO
log4j: Handling log4j.additivity.org.apache.derby=[null]
log4j: Finished configuring.
log4j: Could not find root logger information. Is this OK?
log4j: Parsing for [default] with value=[INFO,defaultAppender].
log4j: Level token is [INFO].
log4j: Category default set to INFO
log4j: Parsing appender named "defaultAppender".
log4j: Parsing layout options for "defaultAppender".
log4j: Setting property [conversionPattern] to [%d %-5p %c: %m%n].
log4j: End of parsing for "defaultAppender".
log4j: Setting property [file] to [@LOGDIR@/default.audit].
log4j: setFile called: @LOGDIR@/default.audit, true
log4j: setFile ended
log4j: Parsed "defaultAppender" options.
log4j: Handling log4j.additivity.default=[null]
log4j: Finished configuring.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/project/sqoop-1.99.3-bin-hadoop200/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/project/sqoop-1.99.3-bin-hadoop200/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/project/hadoop-2.5.2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
May 11, 2016 6:56:03 PM org.apache.catalina.startup.HostConfig deployDirectory
INFO: Deploying web application directory ROOT
May 11, 2016 6:56:03 PM org.apache.coyote.http11.Http11Protocol start
INFO: Starting Coyote HTTP/1.1 on http-12000
May 11, 2016 6:56:03 PM org.apache.catalina.startup.Catalina start
INFO: Server startup in 3605 ms


使用jps命令查看的话,会有一个bootstrap的进程,这个就可以证明sqoop server启动成功。

2.启动sqoop client

命令:sqoop.sh client

[root@sv001 sqoop-1.99.3-bin-hadoop200]# ./bin/sqoop.sh client
Sqoop home directory: /home/project/sqoop-1.99.3-bin-hadoop200
Sqoop Shell: Type 'help' or '\h' for help.

sqoop:000>


3.测试准备以及实施

确认版本信息

sqoop:000> show version -all
client version:
Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
server version:
Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
Protocol version:
[1]


创建server,面向web UI

set server --host localhost --port 12000 --webapp sqoop

sqoop:000> set server --host localhost --port 12000 --webapp sqoop
Server is set successfully


创建connection,成功的话如下:

sqoop:000> create connection --cid 2
Creating connection for connector with id 2
Exception has occurred during processing command
Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception
sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: test-mysql2hdfs

Connection configuration

JDBC Driver Class: com.mysql.jdbc.Driver
JDBC Connection String: jdbc:mysql://<strong>your-mysql-ip</strong>:3306/<strong>sqoop</strong>
Username: <strong>sqoop</strong>
Password: <strong>******</strong>
JDBC Connection Properties:
There are currently 0 values in the map:
entry#

Security related configuration options

Max connections: 10
New connection was successfully created with validation status FINE and persistent id 6


#加粗的部分要注意,是事前在mysql出准备的,且sqoop数据库跟sqoop.properties里的也要保持一致。

此处创建成功的connection是id=6

根据创建成功的connection来创建job【mysql-->HDFS】,如下

sqoop:000> create job --xid 6 --type import
Creating job for connection with id 6
Please fill following values to create new job object
Name: importmysql2hdfs

Database configuration

Schema name: sqoop
Table name: t1
Table SQL statement:
Table column names:
Partition column name: id
Nulls in partition column:
Boundary query:

Output configuration

Storage type:
0 : HDFS
Choose: 0
Output format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
Choose: 0
Compression format:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
Choose: 0
Output directory: /sqoopuse

Throttling resources

Extractors:
Loaders:
New job was successfully created with validation status FINE  and persistent id 4


注意:表定义信息以及数据也是在mysql侧事前准备的。

mysql> select * from t1;
+------+---------+----------+
| id   | int_col | char_col |
+------+---------+----------+
|    2 |       2 | b        |
|    4 |       4 | d        |
|    1 |       1 | a        |
|    3 |       3 | c        |
+------+---------+----------+
4 rows in set (0.00 sec)


导入到hdfs的数据,存储在【Output directory: /sqoopuse】

且job的id=4

创建job【hdfs-->mysql】

sqoop:000> create job --xid 4 --type export
Creating job for connection with id 4
Please fill following values to create new job object
Name: hdfs2mysqlInfo

Database configuration

Schema name: sqoop
Table name: t1
Table SQL statement:
Table column names:
Stage table name:
Clear stage table:

Input configuration

Input directory: /sqoopuse

Throttling resources

Extractors:
Loaders:
New job was successfully created with validation status FINE  and persistent id 11


4.测试实施

1.启动job【mysql-->hdfs】

sqoop:000> start job --jid 4
Submission details
Job ID: 4
Server URL: http://localhost:12000/sqoop/ Created by: root
Creation date: 2016-05-11 19:19:53 JST
Lastly updated by: root
External ID: job_1462962692840_0001 http://sv004:8088/proxy/application_1462962692840_0001/ 2016-05-11 19:19:53 JST: BOOTING  - Progress is not available


2.sleep30s,查看job状态

sqoop:000> status job --jid 4
Submission details
Job ID: 4
Server URL: http://localhost:12000/sqoop/ Created by: root
Creation date: 2016-05-11 19:37:16 JST
Lastly updated by: root
External ID: job_1462962692840_0001 http://sv004:8088/proxy/application_1462962692840_0001/ 2016-05-11 19:37:57 JST: SUCCEEDED
Counters:
org.apache.hadoop.mapreduce.JobCounter
SLOTS_MILLIS_MAPS: 38212
MB_MILLIS_MAPS: 39129088
TOTAL_LAUNCHED_MAPS: 3
MILLIS_MAPS: 38212
VCORES_MILLIS_MAPS: 38212
SLOTS_MILLIS_REDUCES: 0
OTHER_LOCAL_MAPS: 3
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ: 0
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN: 32
org.apache.hadoop.mapreduce.TaskCounter
MAP_INPUT_RECORDS: 0
MERGED_MAP_OUTPUTS: 0
PHYSICAL_MEMORY_BYTES: 497262592
SPILLED_RECORDS: 0
FAILED_SHUFFLE: 0
CPU_MILLISECONDS: 3520
COMMITTED_HEAP_BYTES: 603979776
VIRTUAL_MEMORY_BYTES: 2741444608
MAP_OUTPUT_RECORDS: 4
SPLIT_RAW_BYTES: 346
GC_TIME_MILLIS: 96
org.apache.hadoop.mapreduce.FileSystemCounter
FILE_READ_OPS: 0
FILE_WRITE_OPS: 0
FILE_BYTES_READ: 0
FILE_LARGE_READ_OPS: 0
HDFS_BYTES_READ: 346
FILE_BYTES_WRITTEN: 318117
HDFS_LARGE_READ_OPS: 0
HDFS_BYTES_WRITTEN: 32
HDFS_READ_OPS: 12
HDFS_WRITE_OPS: 6
org.apache.sqoop.submission.counter.SqoopCounters
ROWS_READ: 4
Job executed successfully


3.查看出力在hdfs存储的位置文件

[root@sv001 bin]# ./hadoop fs -ls /sqoopuse
16/05/11 19:43:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r--   3 root supergroup          0 2016-05-11 19:37 /sqoopuse/_SUCCESS
-rw-r--r--   3 root supergroup          8 2016-05-11 19:37 /sqoopuse/part-m-00000
-rw-r--r--   3 root supergroup          8 2016-05-11 19:37 /sqoopuse/part-m-00001
-rw-r--r--   3 root supergroup         16 2016-05-11 19:37 /sqoopuse/part-m-00002


4.确认导入 数据

[root@sv001 bin]# ./hadoop fs -cat /sqoopuse/part*
16/05/11 19:43:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1,1,'a'
2,2,'b'
4,4,'d'
3,3,'c'


这个信息跟mysql中的数据是一致的,可以证明数据在导入的过程中是没有发生丢失的。

测试job【hdfs-->mysql】

1.清空mysql侧的数据

mysql> select * from t1;
+------+---------+----------+
| id   | int_col | char_col |
+------+---------+----------+
|    2 |       2 | b        |
|    4 |       4 | d        |
|    1 |       1 | a        |
|    3 |       3 | c        |
+------+---------+----------+
4 rows in set (0.00 sec)

mysql> delete from t1;
Query OK, 4 rows affected (0.27 sec)

mysql> select * from t1;
Empty set (0.00 sec)

mysql>


2.启动job【hdfs-->mysql】

sqoop:000> start job --jid 11
Submission details
Job ID: 11
Server URL: http://localhost:12000/sqoop/ Created by: root
Creation date: 2016-05-11 19:50:42 JST
Lastly updated by: root
External ID: job_1462962692840_0002 http://sv004:8088/proxy/application_1462962692840_0002/ 2016-05-11 19:50:42 JST: BOOTING  - Progress is not available


3.查看job运行状态

sqoop:000> status job --jid 11
Submission details
Job ID: 11
Server URL: http://localhost:12000/sqoop/ Created by: root
Creation date: 2016-05-11 19:50:42 JST
Lastly updated by: root
External ID: job_1462962692840_0002 http://sv004:8088/proxy/application_1462962692840_0002/ 2016-05-11 19:51:39 JST: SUCCEEDED
Counters:
org.apache.hadoop.mapreduce.JobCounter
SLOTS_MILLIS_MAPS: 204363
MB_MILLIS_MAPS: 209267712
TOTAL_LAUNCHED_MAPS: 8
MILLIS_MAPS: 204363
VCORES_MILLIS_MAPS: 204363
SLOTS_MILLIS_REDUCES: 0
OTHER_LOCAL_MAPS: 8
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
BYTES_WRITTEN: 0
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
BYTES_READ: 0
org.apache.hadoop.mapreduce.TaskCounter
MAP_INPUT_RECORDS: 0
MERGED_MAP_OUTPUTS: 0
PHYSICAL_MEMORY_BYTES: 1327665152
SPILLED_RECORDS: 0
COMMITTED_HEAP_BYTES: 1610612736
CPU_MILLISECONDS: 7590
FAILED_SHUFFLE: 0
VIRTUAL_MEMORY_BYTES: 7262990336
SPLIT_RAW_BYTES: 1224
MAP_OUTPUT_RECORDS: 4
GC_TIME_MILLIS: 316
org.apache.hadoop.mapreduce.FileSystemCounter
FILE_WRITE_OPS: 0
FILE_READ_OPS: 0
FILE_LARGE_READ_OPS: 0
FILE_BYTES_READ: 0
HDFS_BYTES_READ: 1320
FILE_BYTES_WRITTEN: 839664
HDFS_LARGE_READ_OPS: 0
HDFS_WRITE_OPS: 0
HDFS_READ_OPS: 32
HDFS_BYTES_WRITTEN: 0
org.apache.sqoop.submission.counter.SqoopCounters
ROWS_READ: 4
Job executed successfully


4.mysql客户端确认是否已经成功导出,且数据是否有丢失

mysql> select * from t1;
+------+---------+----------+
| id   | int_col | char_col |
+------+---------+----------+
|    2 |       2 | b        |
|    4 |       4 | d        |
|    1 |       1 | a        |
|    3 |       3 | c        |
+------+---------+----------+
4 rows in set (0.00 sec)

mysql> delete from t1;
Query OK, 4 rows affected (0.27 sec)

mysql> select * from t1;
Empty set (0.00 sec)

mysql> select * from t1; <--------select确认
+------+---------+----------+
| id | int_col | char_col |
+------+---------+----------+
| 1 | 1 | a |
| 2 | 2 | b |
| 4 | 4 | d |
| 3 | 3 | c |
+------+---------+----------+
4 rows in set (0.00 sec)


导出成功,且没有发生数据丢失。

---over----
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: