您的位置：首页 > 理论基础 > 计算机网络

Python网络数据采集8(译者：哈雷)

2016-08-11 14:38 281 查看

第六章读取文档

1.读取txt，这个非常简单

from urllib.request import urlopen
textPage = urlopen("http://www.pythonscraping.com/pages/warandpeace/chapter1.txt")
print(textPage.read())

如果文件在本地，直接使用open()和read()函数即可读出所有内容，write()写入数据，readline()也是常用的函数

2.读取csv文件，这个我好像没怎么遇到过，就先不看了，以后遇到了再说吧。

3.读取pdf文档。这个需要安装一个额外的包pdfminer（地址为：https://pypi.python.org/pypi/pdfminer3k），下载解压后进入解压目录，python3 setup.py install 安装。不过这样读取的数据只能是字符串，图像不能显示，表格无法显示格式。

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from io import StringIO
def readPDF(pdfFile):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
process_pdf(rsrcmgr, device, pdfFile)
device.close()
content = retstr.getvalue()
retstr.close()
return content
pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")
#pdfFile = open("chapter1.pdf", 'rb')#如果pdf文件在本地，则使用本语句
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()

本文再提供一种方法，我认为这种方法简单易懂。上一种方法使用的是pdfminer的包，这个使用的是PyPDF2，代码如下

import PyPDF2
pdfFileObj = open('1.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print pdfReader.numPages#获取页数
pageObj = pdfReader.getPage(0)#第一页
print pageObj.extractText()

这个方法解释起来就相当简单，参照txt的读入方式请读者自行理解。

4.在windows下用python读取word文档比较简单，下载相应的包，教程非常多，读者自行google，在linux下

import docx
def getText(filename):
doc = docx.Document(filename)#创建doc对象
fullText = []
print len(doc.paragraphs)#获得doc对象中的段落数目
print doc.paragraphs[0].text：获得第一段的文字内容
print doc,paragraphs[0].runs[0]#获得第一段中的第一种字体的内容，例如段落中有圆体和斜体，则输出圆体的内容
for para in doc.paragraphs: #获得doc中的所有内容
fullText.append(para.text)
return '\n'.join(fullText)

getText("1.doc")

5.在linux下读取excel表格，读写需要xlwt和xlrd两个包，请读者自行下载安装

# -*- coding: utf-8 -*-
import xlwt
import xlrd
import os,sys,string

rootdir = '/home/name1'
rootdir_2 = '/home/name2'

for filename in os.listdir(rootdir):
filepath = rootdir+'/'+filename
data = xlrd.open_workbook(filepath)
book=xlwt.Workbook()#生成一个对象
sheet = book.add_sheet('sheet1',cell_overwrite_ok=True)#添加sheet
sheet.col(0).width=1000#设置第一列的表格宽度
sheet.write(0,0,'Name')#第一行第一列写入Name
table= data.sheets()[0]#第一张表
nrows= table.nrows#获得行数
for i in range(nrows):
try:
old_name = (table.row(i)[1].value)#获取每行单元的内容
new_name = old_name.replace(',',' ')#正则表达式除杂，此处可以写更多
sheet.write(i,0,new_name)#写入
except Exception, e:
print e
book.save(rootdir_2+'/'+filename[:-1])#保存

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航