您的位置：首页 > 编程语言 > Python开发

python 读取 pdf 文档

2017-10-31 15:43 483 查看

这个图片是使用的流程说明，看着是有点绕的，分解来看（学自慕课）

首先使用

open

方法或者

urlopen

打开本场文档或者网络文档（一般会这么做因为考虑到文档太大，对网络服务器负担也很大）生成文档对象，以下的方法之中的网络链接已经存在了

# 获取文档对象
pdf0 = open('sampleFORtest.pdf','rb')
# pdf1 = urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')

接着创建

文档解析器

和

PDF文档对象

并将他们相互关联

# 创建一个与文档关联的解析器
parser = PDFParser(pdf0)

# 创建一个PDF文档对象
doc = PDFDocument()

# 连接两者
parser.set_document(doc)
doc.set_parser(parser)

对

PDF文档对象

进行初始化，如果文档本身进行了加密，则需要在加入

password

参数

# 文档初始化
doc.initialize('')

先创建

PDF资源管理器

和

参数分析器

# 创建PDF资源管理器
resources = PDFResourceManager()

# 创建参数分析器
laparam = LAParams()

再创建一个

聚合器

，并接收

PDF资源管理器

参数分析器

作为参数

# 创建一个聚合器，并接收资源管理器，参数分析器作为参数
device = PDFPageAggregator(resources,laparams=laparam)

最后创建一个

页面解释器

，将

PDF资源管理器

和

聚合器

作为参数

# 创建一个页面解释器
interpreter = PDFPageInterpreter(resources,device)

这样

页面解释器

就具有对PDF文档进行编码，解释成Python能够识别的格式

最后呢，使用

PDF文档对象

的

get_pages()方法

从PDF文档中读取出页面集合，接着使用

页面解释器

对页面集合逐一读取，再调用

聚合器

的

get_result()方法

将页面逐一放置到

layout

之中，最后商用

layout

的

get_text()方法

获取每一页的

text

for page in doc.get_pages():
    # 使用页面解释器读取页面
    interpreter.process_page(page)
    # 使用聚合器读取页面页面内容
    layout = device.get_result()

    for out in layout:
        if hasattr(out,'get_text'):     # 因为文档中不只有text文本
            pprint(out.get_text())

需要注意的是在PDF文档中不只有

text

还可能有图片等等，为了确保不出错先判断对象是否具有

get_text()方法

完整的代码

# encoding:utf-8
'''
@author:
@time:
'''
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pprint import pprint
from urllib.request import urlopen

# 获取文档对象
pdf0 = open('sampleFORtest.pdf','rb')
# pdf1 = urlopen('http://www.tencent.com/zh-cn/content/ir/an/2016/attachments/20160321.pdf')

# 创建一个与文档关联的解释器
parser = PDFParser(pdf0)

# 创建一个PDF文档对象
doc = PDFDocument()

# 连接两者
parser.set_document(doc)
doc.set_parser(parser)

# 文档初始化
doc.initialize('')

# 创建PDF资源管理器
resources = PDFResourceManager()

# 创建参数分析器
laparam = LAParams()

# 创建一个聚合器，并接收资源管理器，参数分析器作为参数
device = PDFPageAggregator(resources,laparams=laparam)

# 创建一个页面解释器
interpreter = PDFPageInterpreter(resources,device)

# 使用文档对象获取页面的集合
for page in doc.get_pages():
    # 使用页面解释器读取页面
    interpreter.process_page(page)
    # 使用聚合器读取页面页面内容
    layout = device.get_result()

    for out in layout:
        if hasattr(out,'get_text'):     # 因为文档中不只有text文本
            pprint(out.get_text())

素材选取是官方提供的

运行的结果：

'Preemptive Information Extraction using Unrestricted Relation Discovery\n'
'Yusuke Shinyama\n'
'Satoshi Sekine\n'
'New York University\n715, Broadway, 7th Floor\nNew York, NY, 10003\n'
'{yusuke,sekine}@cs.nyu.edu\n'
'Abstract\n'
('We are trying to extend the boundary of\n'
'Information Extraction (IE) systems. Ex-\n'
'isting IE systems require a lot of time and\n'
'human effort to tune for a new scenario.\n'
'Preemptive Information Extraction is an\n'
'attempt to automatically create all feasible\n'
'IE systems in advance without human in-\n'
'tervention. We propose a technique called\n'
'Unrestricted Relation Discovery that dis-\n'
'covers all possible relations from texts and\n'
'presents them as tables. We present a pre-\n'
'liminary system that obtains reasonably\n'
'good results.\n')

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python pdf

相关文章推荐

新的分享

章节导航