您的位置:首页 > 其它

使用开源包pdfbox将pdf文件批量转换成txt文件

2012-11-12 21:25 483 查看
前两天,同学苦于不能将上千篇pdf报告转换成txt文档,让我帮忙写程序自动化转换。于是在网上看到开源包pdfbox,好奇地查了查,也参考了网上不少帖子,在别人帖子的基础上,增改了代码,总算解决了同学的烦心事。贴出来,希望对有同样烦恼的同学有所帮助

pdfbox和fontbox下载地址为http://pdfbox.apache.org/download.html

下载pdfbox和fontbox的jar包;

在eclipse新建项目,导入pdfbox和fontbox两个jar包,测试代码可以直接粘贴http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html/article/9230304.html,修正(包括改项目编码utf-8和import正确的包)后直接运行,当然还得给出一篇pdf。

为了批量转换pdf为txt,我对http://www.prasannatech.net/2009/01/convert-pdf-text-parser-java-api-pdfbox.html的代码做了小修改,如下:

package test;

import java.io.File;

import java.io.FileInputStream;

import java.io.PrintWriter;

import org.apache.pdfbox.cos.COSDocument;

import org.apache.pdfbox.pdfparser.PDFParser;

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.pdmodel.PDDocumentInformation;

import org.apache.pdfbox.util.PDFTextStripper;

public class PDFTextParser {

PDFParser parser;

String parsedText;

PDFTextStripper pdfStripper;

PDDocument pdDoc;

COSDocument cosDoc;

PDDocumentInformation pdDocInfo;

// PDFTextParser Constructor

public PDFTextParser() {

}

// Extract text from PDF Document

String pdftoText(String fileName) {

System.out.println("Parsing text from PDF file " + fileName + "....");

File f = new File("input/"+fileName);

if (!f.isFile()) {

System.out.println("File " + fileName + " does not exist.");

return null;

}

try {

parser = new PDFParser(new FileInputStream(f));

} catch (Exception e) {

System.out.println("Unable to open PDF Parser.");

return null;

}

try {

parser.parse();

cosDoc = parser.getDocument();

pdfStripper = new PDFTextStripper();

pdDoc = new PDDocument(cosDoc);

parsedText = pdfStripper.getText(pdDoc);

} catch (Exception e) {

System.out

.println("An exception occured in parsing the PDF Document.");

e.printStackTrace();

try {

if (cosDoc != null)

cosDoc.close();

if (pdDoc != null)

pdDoc.close();

} catch (Exception e1) {

e.printStackTrace();

}

return null;

}

System.out.println("Done.");

return parsedText;

}

// Write the parsed text from PDF to a file

void writeTexttoFile(String pdfText, String fileName) {

System.out.println("\nWriting PDF text to output text file " + fileName

+ "....");

try {

PrintWriter pw = new PrintWriter(fileName);

pw.print(pdfText);

pw.close();

} catch (Exception e) {

System.out

.println("An exception occured in writing the pdf text to file.");

e.printStackTrace();

}

System.out.println("Done.");

}

// Extracts text from a PDF Document and writes it to a text file

public static void main(String args[]) {

File input = new File("input");

if (input.isDirectory()) {

String[] fileList = input.list();

PDFTextParser ptp = new PDFTextParser();

for (String f : fileList) {

String pdfTxt = ptp.pdftoText(f);

if (pdfTxt == null) {

System.out.println("PDF to Text Conversion failed.");

} else {

String outTxtName = f.substring(0, f.length() - 4) + ".txt";

ptp.writeTexttoFile(pdfTxt, "output/" + outTxtName);

}

}

}

}

}

顺利帮同学转换好了1000多篇pdf,过程有时会出现警告

十一月 12, 2012 9:22:12 下午 org.apache.pdfbox.util.PDFStreamEngine processOperator

信息: unsupported/disabled operation: EI

但不影响结果,还没考虑解决办法。另外,遇到过缺少bcprov-jdk15on-147.jar的情况,只要去到jar包对应的网站下载导入即可解决问题。

用pdf转换格式正规的pdf文档(像论文/通知文件/财务报告等格式规范的pdf)效果挺好,转换不太正规的pdf(比如ppt转成的或图片奇怪符号太多的pdf)效果一般。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: