Apache POI 解析 microsoft word 图片文字都不放过
2011-10-21 16:47
344 查看
Overview
The following are components of the entire POI project and a brief summary of their purpose.
POIFS for OLE 2 Documents
POIFS is the oldest and most stable part of the project. It is our port of the OLE 2 Compound Document Format to pure Java. It supports both read and write functionality. All of our components ultimately rely on it by definition. Please see the POIFS project
page for more information.
HSSF for Excel Documents
HSSF is our port of the Microsoft Excel 97(-2003) file format (BIFF8) to pure Java. It supports read and write capability. (Support for Excel 2007 .xlsx files is in progress). Please see the HSSF project page for more information.
HWPF for Word Documents
HWPF is our port of the Microsoft Word 97 file format to pure Java. It supports read, and limited write capabilities. Please see the HWPF project page for more information. This component is in the early stages of development. It can already read and write
simple files.
Presently we are looking for a contributor to foster the HWPF development. Jump in!
HSLF for PowerPoint Documents
HSLF is our port of the Microsoft PowerPoint 97(-2003) file format to pure Java. It supports read and write capabilities. Please see the HSLF project page for more information.
HDGF for Visio Documents
HDGF is our port of the Microsoft Viso 97(-2003) file format to pure Java. It currently only supports reading at a very low level, and simple text extraction. Please see the HDGF project page for more information.
HPSF for Document Properties
HPSF is our port of the OLE 2 property set format to pure Java. Property sets are mostly use to store a document's properties (title, author, date of last modification etc.), but they can be used for application-specific purposes as well.
HPSF supports reading and writing of properties. However, you will need to be using version 3.0 of POI to utilise the write support.
Please see the HPSF project page for more information.
package org.osforce.document.extractor;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.model.PicturesTable;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Picture;
import org.apache.poi.hwpf.usermodel.Range;
/**
*
* @author huhaozhong
* @version 1.0 date 2008.7.27
* microsoft word document extractor extract text and picture
*/
public class MSWordExtractor {
private HWPFDocument msWord;
/**
*
* @param input
* InputStream from file system which has word document stream
* @throws IOException
*/
public MSWordExtractor(InputStream input) throws IOException {
msWord = new HWPFDocument(input);
}
/**
*
* @return all paragraphs of text
*/
public String[] extractParagraphTexts() {
Range range = msWord.getRange();
int numParagraph = range.numParagraphs();
String[] paragraphs = new String[numParagraph];
for (int i = 0; i < numParagraph; i++) {
Paragraph p = range.getParagraph(i);
paragraphs = new String(p.text());
}
return paragraphs;
}
/**
*
* @return all text of a word
*/
public String extractMSWordText() {
Range range = msWord.getRange();
String msWordText = range.text();
return msWordText;
}
/**
*
* @param directory
* local file directory that store the images
* @throws IOException
*/
public void extractImagesIntoDirectory(String directory) throws IOException {
PicturesTable pTable = msWord.getPicturesTable();
int numCharacterRuns = msWord.getRange().numCharacterRuns();
for (int i = 0; i < numCharacterRuns; i++) {
CharacterRun characterRun = msWord.getRange().getCharacterRun(i);
if (pTable.hasPicture(characterRun)) {
System.out.println("have picture!");
Picture pic = pTable.extractPicture(characterRun, false);
String fileName = pic.suggestFullFileName();
OutputStream out = new FileOutputStream(new File(directory
+ File.separator + fileName));
pic.writeImageContent(out);
}
}
}
}
代码比较简单,而且在代码中也做了简单的注释,详细就不介绍了!
The following are components of the entire POI project and a brief summary of their purpose.
POIFS for OLE 2 Documents
POIFS is the oldest and most stable part of the project. It is our port of the OLE 2 Compound Document Format to pure Java. It supports both read and write functionality. All of our components ultimately rely on it by definition. Please see the POIFS project
page for more information.
HSSF for Excel Documents
HSSF is our port of the Microsoft Excel 97(-2003) file format (BIFF8) to pure Java. It supports read and write capability. (Support for Excel 2007 .xlsx files is in progress). Please see the HSSF project page for more information.
HWPF for Word Documents
HWPF is our port of the Microsoft Word 97 file format to pure Java. It supports read, and limited write capabilities. Please see the HWPF project page for more information. This component is in the early stages of development. It can already read and write
simple files.
Presently we are looking for a contributor to foster the HWPF development. Jump in!
HSLF for PowerPoint Documents
HSLF is our port of the Microsoft PowerPoint 97(-2003) file format to pure Java. It supports read and write capabilities. Please see the HSLF project page for more information.
HDGF for Visio Documents
HDGF is our port of the Microsoft Viso 97(-2003) file format to pure Java. It currently only supports reading at a very low level, and simple text extraction. Please see the HDGF project page for more information.
HPSF for Document Properties
HPSF is our port of the OLE 2 property set format to pure Java. Property sets are mostly use to store a document's properties (title, author, date of last modification etc.), but they can be used for application-specific purposes as well.
HPSF supports reading and writing of properties. However, you will need to be using version 3.0 of POI to utilise the write support.
Please see the HPSF project page for more information.
package org.osforce.document.extractor;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.model.PicturesTable;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Picture;
import org.apache.poi.hwpf.usermodel.Range;
/**
*
* @author huhaozhong
* @version 1.0 date 2008.7.27
* microsoft word document extractor extract text and picture
*/
public class MSWordExtractor {
private HWPFDocument msWord;
/**
*
* @param input
* InputStream from file system which has word document stream
* @throws IOException
*/
public MSWordExtractor(InputStream input) throws IOException {
msWord = new HWPFDocument(input);
}
/**
*
* @return all paragraphs of text
*/
public String[] extractParagraphTexts() {
Range range = msWord.getRange();
int numParagraph = range.numParagraphs();
String[] paragraphs = new String[numParagraph];
for (int i = 0; i < numParagraph; i++) {
Paragraph p = range.getParagraph(i);
paragraphs = new String(p.text());
}
return paragraphs;
}
/**
*
* @return all text of a word
*/
public String extractMSWordText() {
Range range = msWord.getRange();
String msWordText = range.text();
return msWordText;
}
/**
*
* @param directory
* local file directory that store the images
* @throws IOException
*/
public void extractImagesIntoDirectory(String directory) throws IOException {
PicturesTable pTable = msWord.getPicturesTable();
int numCharacterRuns = msWord.getRange().numCharacterRuns();
for (int i = 0; i < numCharacterRuns; i++) {
CharacterRun characterRun = msWord.getRange().getCharacterRun(i);
if (pTable.hasPicture(characterRun)) {
System.out.println("have picture!");
Picture pic = pTable.extractPicture(characterRun, false);
String fileName = pic.suggestFullFileName();
OutputStream out = new FileOutputStream(new File(directory
+ File.separator + fileName));
pic.writeImageContent(out);
}
}
}
}
代码比较简单,而且在代码中也做了简单的注释,详细就不介绍了!
相关文章推荐
- Apache POI 解析 microsoft word 图片文字都不放过
- Apache POI 解析 microsoft word 图片文字都不放过
- Apache POI 解析 microsoft word 图片文字都不放过
- Apache POI 解析 microsoft word 图片文字都不放过
- Apache POI 解析 microsoft word 图片文字都不放过
- Apache POI 解析 microsoft word 图片文字都不放过
- 【Microsoft Word】编辑文字后,图片位置混乱的解决方法
- HTML5 文字及图片标签解析
- ORC工具解析图片文字
- Json解析Handler+异步文字+异步图片
- 使用ListView和AsyncTask、fastjson解析Json以及适配器BaseAdapter来实现下载网络的图片以及文字并显示出来
- OpenResy+Lua 利用百度识图 将图片地址解析成文字
- POST解析(Imagerloader、异步、图片、文字、listv显示)
- Quartz2d从易到难全解析---绘制圆弧、圆、图片和文字
- java 二维码 生成和解析 (中间:图片、文字;底部:文字)
- Apache PdfBox 2.0.X 版本解析PDF文档(文字和图片)
- 获取json解析 图片,文字+MyAsyncTask+ImagerLoader
- cocos2d-x之 利用富文本控件解析xhml标签(文字标签,图片标签,换行标签,标签属性)
- 让文字在图片上输出
- 如何在上传的图片上加上版权文字[转]