您的位置：首页 > 编程语言 > Python开发

Python操作hive与mysql

2018-12-21 23:34 756 查看

由于hive查询结果是不能直接保存到mysql的，有人会用sqoop，相对还是不方便。当然了，肯定还有一些其它的hive~mysql互导工具，通常我们用python驱动hiveserver2，hivecli官方并不建议，也利用python将结果保存到mysql。那么python操作hive查询结果保存到mysql过程是怎么样的呢？

1、需要的包

网上百度，搜到的包不外乎下面3个，这里用第二个。

1、pyhs2,已没有更新维护
https://github.com/BradRuderman/pyhs2

2、pyhive
https://github.com/dropbox/PyHive

3、impyla
https://github.com/cloudera/impyla

2、我们选择pyhive

安装基本顺利，安装过程如下：

# 这个不装会报错，会有错误提示
pip install thrift
pip install pyhive

你以为安装完了吗，连接hive时报错：

ImportError: No module named sasl

提示，需要安装sasl

pip install sasl

# 我在不同版本安装发现，有的linux系统可能不会报错
# 如果安装sasl报错报错，则执行下面语句
yum install cyrus-sasl-devel saslwrapper

stackoverflow问题及回答地址：https://stackoverflow.com/questions/37452024/install-thrift-sasl-for-python-in-windows

还是报错，错误提示：

ImportError: No module named thrift_sasl

需要安装thrift_sasl

pip install thrift-sasl

按照错误提示，一步一步，到这里，pyhive就安装好了。

3、下面举一个例子

我们需要将分组聚合结果保存到mysql，供展示系统展示，我们需要：

1)拿到hive连接conn：

def getConn(self, host='0.0.0.0',port=10000,username='user',password='pass',database='database',auth='LDAP'):
"""
create connection to hive server2
"""
self.conn = hive.Connection(host=host,
port=port,
username=username,
password=password,
database=database,
auth=auth)

2)查询

假设sql = ‘select date,count(*) from table group by date;’

需要一个查询函数：

def query(self, sql):
"""
query
"""
with self.conn.cursor() as cursor:
cursor.execute(sql)
return cursor.fetchall()

3)关闭连接方法

def close(self):
"""
close connection
"""
self.conn.close()

4）保存到mysql

def insert(self, sql):
cursor = self.conn.cursor()
cursor.execute(sql)
cursor.close()
self.conn.commit()

结果返回的肯定是个[(date,count)]列表装元组的数据类型，那么我们需要遍历list保存结果，保存为多行：

for item in result:
product_name = item[0]
device_count = item[1]
device_count_all = item[2]
sql = '''insert into statistic_v2_productname(product_name, date, hour, device_count, device_count_all) values\
('%s', '%s', %d, %d, %d)'''%(product_name.replace("'", ""), date, 0, device_count, device_count_all)
mclient.insert(sql)

5）关闭连接：

跟其它语言一样，在python中，连接数据库后记得关闭连接。open()方法打开文件写入后也要记得关闭，否则你会发现，好像没有写进去。

1mclient.close()
2hclient.close()

这里的mclient,hclient都是连接。

4、代码简化

显然，这种相似统计，肯定会有很多，每次我们都重复地初始化hive,mysql连接，然后获取cursor（游标），执行（查询，插入，删除），关闭数据库操作，无疑是浪费时间的，我们希望将自己从烦劳的重复代码中解放出来，代码需要简化。代码简化主要从两个方面：

1)、功能封装；

2）、再业务封装；

初级阶段的业务封装对后期使用是不友好的，这里只对功能进行封装，防止后面会有其它功能业务；

1、考虑到，获取连接可以是个单例，数据库可以有测试与线上，query、execute，关闭操作，是可以封装的。

2、代码封装成为一个方法到一个脚本中本次使用方便，其它脚本还得重新写，这是不行的，怎么办，自己写模块，封装成包，放在本地，供自己和其它同事使用。

这里主要封装了常用的时间模块，比如获取前2个小时，前5天的日期列表等等，然后就是sql模块，着重是sql模块，放在sql目录下，拿hive模块来说：

# -*- coding:utf-8 -*-

from pyhive import hive

class HiveClient(object):
"""docstring for HiveClient"""
def __init__(self, host='0.0.0.0',port=10000,username='user',password='pass',database='database',auth='LDAP'):
"""
create connection to hive server2
"""
self.conn = hive.Connection(host=host,
port=port,
username=username,
password=password,
database=database,
auth=auth)

def query(self, sql):
"""
query
"""
with self.conn.cursor() as cursor:
cursor.execute(sql)
return cursor.fetchall()
def insert(self, sql):
"""
insert action
"""
with self.conn.cursor() as cursor:
try:
cursor.execute(sql)
self.conn.commit()
except:
self.conn.rollback()

def close(self):
"""
close connection
"""
self.conn.close()

这里定义了hive的获取连接，查询、插入、关闭连接功能。默认参数是测试数据库连接信息，使用其它数据库只需要传入对应数据库连接参数即可。

这样下次使用的时候，只需要导入对应类即可

5、结果很甜

sys.path.append('/home/hadoop/scripts/python_module')，导入环境，导入对应包即可。

# -*- coding: utf-8 -*-

import sys
sys.path.append('/home/hadoop/scripts/python_module')
import keguang.timedef as timedef
import keguang.sql.hiveclient as hive
import keguang.sql.mysqlclient as mysql

这样就拿到了，hive，mysql操作模块，获取连接，查询、插入、关闭连接只需要用拿到的对象的方法，传入参数即可。

hclient = hive.HiveClient()

这就拿到了一个hive连接，不传参数，拿到的是测试库连接。

这样只需要定义sql，调用对应方法即可。比如，我们定义一个sql

sql = '''
select t3.productname, t3.ct, t2.cou from (select t.productname,count(t.guid) ct from \
(select (case when productname = '' or productname is null then 'null' else productname end) \
as productname, guid from hm2.author where dt = '%s' group by productname, guid)t group by t.productname) t3\
inner join \
(select (case when productname = '' or productname is null then 'null' else productname end)\
as productname,count(guid) cou from hm2.author where dt = '%s' group by productname)t2 \
on t2.productname = t3.productname
'''%(date, date)

然后调用query()方法拿到结果即可。

1result = hclient.query(sql)

一系列统计功能写下来，我们会发现，满屏的sql，功能代码很少，这就是我们要的效果。

这样无疑是我们只用关注实际业务，而不用重复写通用重复代码了，而且还可以根据实际功能，扩展功能。

同理，结果保存到mysql，只需要调用mysql模块相应方法即可。
原文已同步至我的个人博客。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航