您的位置：首页 > 运维架构 > Apache

Apache PdfBox 2.0.X 版本解析PDF文档（文字和图片）

2018-03-09 17:18 393 查看

        最近项目开发过程涉及到了pdf文件的内容的解析和和内容的提取入库操作，其中pdf的解析采用了开源的apache pdfbox 插件，版本选用的是最新版本的2.0.8版本，现将简单的读取解析的步骤记录如下：
        1、导入jar，基础的需要 pdfbox-2.0.8.jar ，fontbox-2.0.8.jar 2个jar包

            Apache下载链接如下：
                    https://pdfbox.apache.org/download.cgi
            mvean可以如下添加：
                    <dependency>
                            <groupId>org.apache.pdfbox</groupId>
                            <artifactId>pdfbox</artifactId>
                            <version>2.0.8</version>

                    </dependency>
                      <dependency>
                            <groupId>org.apache.pdfbox</groupId>
                            <artifactId>fontbox</artifactId>
      <version>2.0.8</version>

                       </dependency>
        2、从PDF中获取文本内容：
            首先读取文件，或者获取web上传的文件流，然后生成pdfdocument，最后document进行遍历解析，封装自己想要的数据或者对象，具体的解析代码如下：
     /**
   * 从pdf文件中解析为字符串,只能返回pdf中的文字内容，图片，表格均解析不了
   * @param pdfFile
   * @param sort 是否有序
   * @return
   * @throws Exception
   */
   public static String getTextFromPdf(InputStream fileStream, boolean sort) {
       // 开始提取页数
       int startPage = 1;
       // 结束提取页数
       String content = null;
       PDDocument document = null;
       try {
           // 加载 pdf 文档
           document = PDDocument.load(fileStream);
           int endPage = null == document ? Integer.MAX_VALUE : document.getNumberOfPages();
           PDFTextStripper stripper = new PDFTextStripper();
           stripper.setSortByPosition(sort);
           stripper.setStartPage(startPage);
           stripper.setEndPage(endPage);
           content = stripper.getText(document);
           log.info("pdf 文件解析，内容为：" + content);
       } catch (Exception e) {
           log.error("文件解析异常，信息为： " + e.getMessage());
       }
       return content;

   }

    3、从pdf文档中抓取图片的列表信息（话不多说，直接贴代码）
        /**
   * 从pdf文档中读取所有的图片信息
   *
   * @return
   * @throws Exception
   */
   public static List<PDImageXObject> getImageListFromPDF(PDDocument document,Integer startPage) throws Exception {
       List<PDImageXObject> imageList = new ArrayList<PDImageXObject>();
       if(null != document){
           PDPageTree pages = document.getPages();
           startPage = startPage == null ? 0 : startPage;
           int len = pages.getCount();
           if(startPage < len){
               for(int i=startPage;i<len;i++){
                   PDPage page = pages.get(i);
                   Iterable<COSName> objectNames = page.getResources().getXObjectNames();
                   for(COSName imageObjectName : objectNames){
                       if(page.getResources().isImageXObject(imageObjectName)){
                           imageList.add((PDImageXObject) page.getResources().getXObject(imageObjectName));
                       }
                   }
               }
           }
       }
       return imageList;
   }

注意：上个方法中返回的list中为 PDImageXObject 对象，不是我们Java中对应的Image对象，所以不能直接保存到本地或者提交到服务器，需要进行简单的转换一下，例子可参考如下：
        /**
   * 读取图片文件流信息
   * @param iamge
   * @return
   * @throws Exception
   */
   public static InputStream getImageInputStream(PDImageXObject iamge) throws Exception
   {
       if(null!=iamge && null!= iamge.getImage())
       {
           BufferedImage bufferImage = iamge.getImage();
           ByteArrayOutputStream os = new ByteArrayOutputStream();
            ImageIO.write(bufferImage, iamge.getSuffix(), os);
           return new ByteArrayInputStream(os.toByteArray());
       }
       return null;
   }

这样就可以读取到对应的图片的例子，可以new File对象写到磁盘上，如：
                    File imgFile = new File("e:\\"+name+"."+image.getSuffix());
                    FileOutputStream fout = new FileOutputStream(imgFile);
                    ByteArrayOutputStream os = new ByteArrayOutputStream();
                    ImageIO.write(imageb, image.getSuffix(), os);
                    InputStream is = new ByteArrayInputStream(os.toByteArray());
                    int byteCount = 0;
                    byte[] bytes = new byte[1024];

                    while ((byteCount = is.read(bytes)) > 0)
                    {
                        fout.write(bytes,0,byteCount);
                    }

                    fout.close();

                    is.close();
以上仅供参考，经测试可以解析到文本和图片并且可以保存入库和view层展示下载等，代码只是实现了原理，没有进行进一步的优化，希望大家指正，谢谢

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： PDF解析 PDFBox 图片获取

相关文章推荐

新的分享

章节导航