您的位置：首页 > 其它

使用开源包pdfbox将pdf文件批量转换成txt文件

2012-11-12 21:25 483 查看

前两天，同学苦于不能将上千篇pdf报告转换成txt文档，让我帮忙写程序自动化转换。于是在网上看到开源包pdfbox，好奇地查了查，也参考了网上不少帖子，在别人帖子的基础上，增改了代码，总算解决了同学的烦心事。贴出来，希望对有同样烦恼的同学有所帮助

pdfbox和fontbox下载地址为http://pdfbox.apache.org/download.html。

下载pdfbox和fontbox的jar包；

在eclipse新建项目，导入pdfbox和fontbox两个jar包，测试代码可以直接粘贴http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html和/article/9230304.html，修正（包括改项目编码utf-8和import正确的包）后直接运行，当然还得给出一篇pdf。

为了批量转换pdf为txt，我对http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html的代码做了小修改，如下：

package test;

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import org.apache.pdfbox.cos.COSDocument;

import org.apache.pdfbox.pdfparser.PDFParser;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdmodel.PDDocumentInformation;

import org.apache.pdfbox.util.PDFTextStripper;

public class PDFTextParser {

PDFParser parser;

String parsedText;

PDFTextStripper pdfStripper;

PDDocument pdDoc;

COSDocument cosDoc;

PDDocumentInformation pdDocInfo;

// PDFTextParser Constructor

public PDFTextParser() {

}

// Extract text from PDF Document

String pdftoText(String fileName) {

System.out.println("Parsing text from PDF file " + fileName + "....");

File f = new File("input/"+fileName);

if (!f.isFile()) {

System.out.println("File " + fileName + " does not exist.");

return null;

}

try {

parser = new PDFParser(new FileInputStream(f));

} catch (Exception e) {

System.out.println("Unable to open PDF Parser.");

return null;

}

try {

parser.parse();

cosDoc = parser.getDocument();

pdfStripper = new PDFTextStripper();

pdDoc = new PDDocument(cosDoc);

parsedText = pdfStripper.getText(pdDoc);

} catch (Exception e) {

System.out

.println("An exception occured in parsing the PDF Document.");

e.printStackTrace();

try {

if (cosDoc != null)

cosDoc.close();

if (pdDoc != null)

pdDoc.close();

} catch (Exception e1) {

e.printStackTrace();

}

return null;

}

System.out.println("Done.");

return parsedText;

}

// Write the parsed text from PDF to a file

void writeTexttoFile(String pdfText, String fileName) {

System.out.println("\nWriting PDF text to output text file " + fileName

+ "....");

try {

PrintWriter pw = new PrintWriter(fileName);

pw.print(pdfText);

pw.close();

} catch (Exception e) {

System.out

.println("An exception occured in writing the pdf text to file.");

e.printStackTrace();

}

System.out.println("Done.");

}

// Extracts text from a PDF Document and writes it to a text file

public static void main(String args[]) {

File input = new File("input");

if (input.isDirectory()) {

String[] fileList = input.list();

PDFTextParser ptp = new PDFTextParser();

for (String f : fileList) {

String pdfTxt = ptp.pdftoText(f);

if (pdfTxt == null) {

System.out.println("PDF to Text Conversion failed.");

} else {

String outTxtName = f.substring(0, f.length() - 4) + ".txt";

ptp.writeTexttoFile(pdfTxt, "output/" + outTxtName);

}

}

}

}

}

顺利帮同学转换好了1000多篇pdf，过程有时会出现警告

十一月 12, 2012 9:22:12 下午 org.apache.pdfbox.util.PDFStreamEngine processOperator

信息: unsupported/disabled operation: EI

但不影响结果，还没考虑解决办法。另外，遇到过缺少bcprov-jdk15on-147.jar的情况，只要去到jar包对应的网站下载导入即可解决问题。

用pdf转换格式正规的pdf文档（像论文/通知文件/财务报告等格式规范的pdf）效果挺好，转换不太正规的pdf（比如ppt转成的或图片奇怪符号太多的pdf）效果一般。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航