您的位置：首页 > 编程语言 > Java开发

如何用java来进行文件切割和简单的内容过滤

2017-01-30 19:21 453 查看

一由来去年由于项目的需求，要将一个任意一个文件制作成一个xml文件，并且需要保持文件内容本身不产生变化，还要能够将这个xml重新还原为原文件。如果小型的文件还好处理，大型的xml,比如几个G的文件，基本上就OOM了，很难直接从节点中提取数据。所以我采用了流的方式。于是有了这个文件的裁剪工具。

二使用场景

本工具可能的使用场景： 1.对任一文件的切割/裁剪。通过字节流的方式，开始节点和终止节点，裁剪出两个节点之间的部分。 2.往任一文件的头/尾拼接指定字符串。可以很容易将一个文件嵌入在某一个节点中。 3.简单的文本抽取。可以根据自己定义的规则，提取出来想要的文本内容，并且允许对提取出来的文本进行再处理（当然，只是进行简单地抽取文字，并不是什么智能的复杂过程的抽取T_T ）。 4.文本过滤。根据自己制定的规则，过滤掉指定的文字。整个工具仅是对Java文件操作api的简单加工，并且也没有使用nio。在需要高效率的文件处理情景下，本工具的使用有待考量。文章目的是为了给出自己的一种解决方案，若有更好的方案，欢迎大家给出适当的建议。

三如何使用

别的先不说，来看看如何使用吧！1.读取文件指定片段读取第0~1048个字节之间的内容。

public void readasbytes(){
FileExtractor cuter = new FileExtractor();
byte[] bytes = cuter.from("D:\\11.txt").start(0).end(1048).readAsBytes();
}

12341234

2.文件切割将第0~1048个字节之间的部分切割为一个新文件。

public File splitAsFile(){
FileExtractor cuter = new FileExtractor();
return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
}

12341234

3.将文件拼接到一个xml节点中将整个文件的内容作为Body节点，写入到一个xml文件中。返回新生成的xml文件对象。

public File appendText(){

FileExtractor cuter = new FileExtractor();
return cuter.from("D:\\11.txt").to("D:\\44.xml").appendAsFile("<Document><Body>", "</Body></Document>");

}

12345671234567

4.读取并处理文件中的指定内容假如有需求：读取11.txt的前三行文字。其中，第一行和第二行不能出现”帅”字，并且在第三行文字后加上字符串“我好帅！”。

public String  extractText(){
FileExtractor cuter = new FileExtractor();
return cuter.from("D:\\11.txt").extractAsString(new EasyProcesser() {
@Override
public String finalStep(String line, int lineNumber, Status status) {

if(lineNumber==3){
status.shouldContinue = false;//表示不再继续读取文件内容
return line+"我好帅!";
}
return line.replaceAll("帅","");
}
});

}

1234567891011121314151612345678910111213141516

4.简单的文本过滤将一个文件中所有的“bug”去掉，且返回一个处理后的新文件。

public File killBugs(){
FileExtractor cuter = new FileExtractor();
return cuter.from("D:\\bugs.txt").to("D:\\nobug.txt").extractAsFile(new EasyProcesser() {
@Override
public String finalStep(String line, int lineNumber, Status status) {
return line.replaceAll("bug", "");
}
});
}

1234567891012345678910

四基本流程

通过接口回调的方式，将文件的读取过程和处理过程分离开来；定义了IteratorFile类来负责遍历一个文件，读取文件的内容；分字节、行两种的方式来进行文件内容的遍历。下面的介绍，也会分为读取和处理两个部分单独介绍。

五文件的读取

定义回调接口

定义一个接口Process,对外暴露了两个文件内容处理方法，一个支持按字节进行读取，一个方法支持按行读取。

public interface Process{

/**
* @param b 本次读取的数据
* @param length 本次读取的有效长度
* @param currentIndex 当前读取到的位置
* @param available 读取文件的总长度
* @return true 表示继续读取文件，false表示终止读取文件
* @time 2017年1月22日 下午4:56:41
*/
public boolean doWhat(byte[] b,int length,int currentIndex,int available);

/**
*
* @param line 本次读取到的行
* @param currentIndex 行号
* @return true 表示继续读取文件，false表示终止读取文件
* @time 2017年1月22日 下午4:59:03
*/
public boolean doWhat(String line,int currentIndex);

12345678910111213141516171819201234567891011121314151617181920

让ItratorFile中本身实现这个接口，但是默认都是返回true,不做任何的处理。如下所示：

public  class IteratorFile implements Process
{
......
/**
* 按照字节来读取遍历文件内容，根据自定义需要重写该方法
*/
@Override
public boolean doWhat(byte[] b, int length,int currentIndex,int available) {
return true;
}

/**
* 按照行来读取遍历文件内容，根据自定义需要重写该方法
*/
@Override
public boolean doWhat(String line,int currentIndex) {
return true;
}
......
}

12345678910111213141516171819201234567891011121314151617181920

按字节遍历文件内容

实现按照字节的方式来进行文件的遍历（读取）。在这里使用了skip（）方法来控制从第几个节点开始读取内容；然后在使用文件流读取的时候，将每次读取到得数据传递给回调接口的方法；需要注意的是，每次读取到得数据是存在一个字节数组bytes里面的，每次读取的长度也是需要传递给回调接口的。我们很容易看出，一旦dowhat()返回false,文件的读取立即就退出了。

public void iterator2Bytes(){
init();
int length = -1;
FileInputStream fis = null;
try {
file = new File(in);
fis = new FileInputStream(file);
available = fis.available();
fis.skip(getStart());
readedIndex = getStart();
if (!beforeItrator()) return;
while ((length=fis.read(bytes))!=-1) {
readedIndex+=length;
if(!doWhat(bytes, length,readedIndex,available)){
break;
}
}
if(!afterItrator()) return;
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}finally{
try {
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

1234567891011121314151617181920212223242526272829303112345678910111213141516171819202122232425262728293031

按行来遍历文件内容

常规的文件读取方式，在while循环中，调用了回调接口的方法，并且传递相关的数据。

public void iterator2Line(){
init();
BufferedReader reader = null;
FileReader read = null;
String line = null;
try {
file = new File(in);
read = new FileReader(file);
reader = new BufferedReader(read);
if (!beforeItrator()) return;
while ( null != (line=reader.readLine())) {
readedIndex++;
if(!doWhat(line,readedIndex)){
break;
}
}
if(!afterItrator()) return ;
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}finally{
try {
read.close();
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

1234567891011121314151617181920212223242526272829303112345678910111213141516171819202122232425262728293031

然后，还需要提供方法来设置要读取的源文件路径。

public IteratorFile from(String in){
this.in = in;
return this;
}

1234512345

六文件内容处理

FileExtractor介绍

定义了FileExtractor类，来封装对文件内容的处理操作；该类会引用到遍历文件所需要的类IteratorFile。

FileExtractor的基本方法

/**
* 往文件头或者文件结尾插入字符串
* @tips 不能对同一个文件输出路径反复执行该方法，否则会出现文本异常，因为用到了RandomAccessFile,如有需要，调用前需手动删除原有的同名文件
* @param startStr 文件开头要插入的字符串
* @param endStr 文件结尾要插入的字符串
* @return 生成的新文件
* @time 2017年1月22日 下午5:05:35
*/
public File appendAsFile(final String startStr,String endStr){}

/**
* 从指定位置截取文件
* @tips 适合所有的文件类型
* @return
* @time 2017年1月22日 下午5:06:36
*/
public File splitAsFile(){}

/**
* 文本文件的特殊处理（情景：文本抽取，文本替换等）
* @tips 只适合文本文件，对于二进制文件，因为换行符的原因导致文件出现可能无法执行等问题。
* @time 2017年1月22日 下午5:09:14
*/
public File extractAsFile(FlowLineProcesser method) {

/**
* 文本文件的特殊处理（情景：文本抽取，文本替换等）
* @tips 只适合文本文件，对于二进制文件，因为换行符的原因导致文件出现可能无法执行等问题。
* @time 2017年1月22日 下午5:09:14
*/
public String extractAsString(FlowLineProcesser method) {}

/**
* 读取指定位置的文件内容为字节数组
* @return
* @time 2017年1月23日 上午11:06:18
*/
public byte[] readAsBytes(){}

123456789101112131415161718192021222324252627282930313233343536373839404142123456789101112131415161718192021222324252627282930313233343536373839404142

其中，返回值为File的方法在处理完成后，都出返回一个经过内容后的新文件。

其他方法

同样，设置源文件位置的方法，以及截取位置的相关方法

/**
* 设置源文件
*/
public FileExtractor from(String in){
this.in = in;
return this;
}

/**
* 设置生成临时文件的位置（返回值为File的方法均需要设置）
*/
public FileExtractor to(String out) {
this.out = out;
return this;
}

/**
* 文本开始截取的位置（包含此位置），字节相关的方法均需要设置
*/
public FileExtractor start(int start){
this.startPos = start;
return this;
}

/**
* 文本截取的终止位置（包含此位置），字节相关方法均需要设置
*/
public FileExtractor end(int end) {
this.endPos = end;
return this;
}

12345678910111213141516171819202122232425262728293031321234567891011121314151617181920212223242526272829303132

按字节读取文件时的文件内容处理

有几个重点：1.因为要根据字节的位置来进行文件截取，所以需要根据字节来遍历文件，所以要重写doWhat()字节遍历的的方法。并在外部构造一个OutPutStream来进行新文件的写出工作。2.每次遍历读取出的文件内容，都存放在一个字节数组b里面，但并不是b中的数据都是有用的，所以需要传递b有效长度length。3.readedIndex记录了到本次为止（包括本次）为止，已经读取了多少位数据。4.按照自己来遍历文件时，如何判断读取到了的终止位置？当（已读的数据总长度）readedIndex>endPos（终止节点）时，说明本次读取的时候超过了应该终止的位置，此时b数组中有一部分数据就是多读的了，这部分数据是不应该被保存的。我们可以通过计算得到读超了多少位，即length-(readedIndex-endPos-1)，那么只要保存这部分数据就可以了。读取指定片段的文件内容：

//本方法在需要读取的数据多时，不建议使用，因为byte[]是不可变的，多次读取的时候，需要进行多次的byete[] copy过程，效率“感人”。
public byte[] readAsBytes(){

try {
checkIn();
} catch (Exception e) {
e.printStackTrace();
return null;
}

//临时保存字节的容器
final BytesBuffer buffer = new BytesBuffer();

IteratorFile c = new IteratorFile(){
@Override
public boolean doWhat(byte[] b, int length, int currentIndex,
int available) {
if(readedIndex>endPos){
//说明已经读取到了endingPos位置并且读超了
buffer.addBytes(b, 0, length-(readedIndex-endPos-1)-1);
return false;
}else{
buffer.addBytes(b, 0, length-1);
}
return true;
}
};
//按照字节进行遍历
c.from(in).start(startPos).iterator2Bytes();

return buffer.toBytes();

}

123456789101112131415161718192021222324252627282930313233343536123456789101112131415161718192021222324252627282930313233343536

当文件很大时，生成一个新的文件的比较靠谱的方法，所以，类似直接返回byte[]，在文件读取之前，设置一个outputSteam,在内容循环读取的过程中，将读取的内容写入到一个新文件中去。

public File splitAsFile(){
......
final OutputStream os = FileUtils.openOut(file);
try {
IteratorFile itFile = new IteratorFile(){
@Override
public boolean doWhat(byte[] b, int length,int readedIndex,int available) {
try {
if(readedIndex>endPos){
//说明已经读取到了endingPos位置,并且读超了readedIndex-getEnd()-1位
os.write(b, 0, length-(readedIndex-endPos-1));
return false;//终止读取
}else{
os.write(b, 0, length);
}
return true;
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
}.from(in).start(startPos);

itFile.iterator2Bytes();

} catch (Exception e) {
e.printStackTrace();
this.tempFile = null;
}finally{
try {
os.flush();
os.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return getTempFile();
}

123456789101112131415161718192021222324252627282930313233343536373839123456789101112131415161718192021222324252627282930313233343536373839

按行来读取时的文件内容处理

首先，再次声明，按行来遍历文件的时候，只适合文本文件。除非你对每一行的换行符用\r还是\n没有要求。像exe文件，如果用行来遍历的话，你写出为一个新的文件的时候，任意一个的换行符的不对都可能导致一个exe文件变为”unexe”文件！过程中，我用到了：一个辅助类Status，来辅助控制遍历的流程。一个接口FlowLineProcesser，类似于一个处理文本的流水线。Status和FlowLineProcesser是相互辅助的，Status也能辅助FlowLineProcesse是流水线的具体过程，Status是控制处理过程中怎么处理d的。我也想了许多次，到底要不要把这个过程搞的这么复杂。但是还是先留着吧…先看辅助类Status:

public class Status{
/**
* 是否找到了开头,默认false，若true则后续的遍历不会执行相应的firstStep()方法
*/
public boolean overFirstStep = false;

/**
* 是否找到了结尾，默认false,若true则后续的遍历不会执行相应的finalStep()方法
*/
public boolean overFinalStep = false;

/**
* 是否继续读取源文件，默认true表示继续读取，false则表示，执行本次操作后，遍历终止
*/
public boolean shouldContinue = true;
}

12345678910111213141516171234567891011121314151617

然后是FlowLineProcesser接口：FlowLineProcesser是一个接口，类似于一个流水线。定义了两步操作，分别对应两个方法fistStep()和finalStep()。其中两个方法的返回值都是String，firstStep接受到得line是真正从文件中读取到的行，它将line经过自己的处理后，返回处理后的line给finalStep。所以，finalStep中得line其实是firstStep处理后的结果。但是最终真正返回给主处理流程的line，正是finalStep处理后的返回值。

public interface FlowLineProcesser{
/**
*
* @param line 读取到的行
* @param lineNumber 行号,从1开始
* @param status 控制器
* @return
* @time 2017年1月22日 下午5:02:02
*/
String firstStep(String line,int lineNumber,Status status);

/**
* @tips
* @param line 读取到的行（是firstStep()处理后的结果）
* @param lineNumber 行号,从1开始
* @param status 控制器
* @return
* @time 2017年1月22日 下午5:02:09
*/
String finalStep(String line,int lineNumber,Status status);
}

1234567891011121314151617181920212212345678910111213141516171819202122

现在，可以来看一下如何去实现文本的抽取了：所有读取的行，都临时存到一个stringbuilder中去。firstStep先进行一次处理，得到返回值后传递给finalStep,再次处理后，将得到的结果保存下来。如果最后的结果是null,则不会保存。

public String extractAsString(FlowLineProcesser method) {

try {
checkIn();
} catch (Exception e) {
e.printStackTrace();
return null;
}

final StringBuilder builder = new StringBuilder();

this.mMethod = method;

new IteratorFile(){
Status status = new Status();
@Override
public boolean doWhat(String line, int currentIndex) {
String lineAfterProcess = "";

if(!status.overFirstStep){
lineAfterProcess = mMethod.firstStep(line, currentIndex,status);
}

if(!status.shouldContinue){
return false;
}

if(!status.overFinalStep){
lineAfterProcess = mMethod.finalStep(lineAfterProcess,currentIndex,status);
}

if(lineAfterProcess!=null){
builder.append(lineAfterProcess);
builder.append(getLineStr());//换行符被写死在这里了
}

if(!status.shouldContinue){
return false;
}
return true;
}

}.from(in).iterator2Line();

return builder.toString();

}

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748123456789101112131415161718192021222324252627282930313233343536373839404142434445464748

当要抽取的文本太大的时候，可以采用生成新文件的方式。与返回string的流程基本一致。

public File extractAsFile(FlowLineProcesser method) {

try {
checkIn();
checkOut();
} catch (Exception e) {
e.printStackTrace();
return null;
}

this.mMethod = method;
File file = initOutFile();
if(file==null){
return null;
}

FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(file);
} catch (Exception e) {
e.printStackTrace();
return null;
}

final BufferedWriter writer = new BufferedWriter(fileWriter);

IteratorFile itfile = new IteratorFile(){
Status status = new Status();
@Override
public boolean doWhat(String line, int currentIndex) {
String lineAfterProcess = "";

if(!status.overFirstStep){
lineAfterProcess = mMethod.firstStep(line, currentIndex,status);
}

if(!status.shouldContinue){
return false;
}

if(!status.overFinalStep){
lineAfterProcess = mMethod.finalStep(lineAfterProcess,currentIndex,status);
}

if(lineAfterProcess!=null){
try {
writer.write(lineAfterProcess);
writer.newLine();//TODO 换行符在此给写死了
} catch (IOException e) {
e.printStackTrace();
return false;
}
}

if(!status.shouldContinue){
return false;
}
return true;

}
};

itfile.from(in).iterator2Line();

if(writer!=null){
try {
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
try {
fileWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
return getTempFile();

}

转载于：http://blog.csdn.net/qq_35101189/article/details/54782303

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 数据需求 xml java

相关文章推荐

新的分享

章节导航