您的位置:首页 > 其它

How to remove duplicate lines in a large text file?

2018-09-15 13:22 751 查看

How would you remove duplicate lines from a file that is  much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove
// duplicates from input.txt and
// save output to output.txt

import java.io.*;
import java.util.HashSet;

public class FileOperation
{
public static void main(String[] args) throws IOException
{
// PrintWriter object for output.txt
PrintWriter pw = new PrintWriter("output.txt");

// BufferedReader object for input.txt
BufferedReader br = new BufferedReader(new FileReader("input.txt"));

String line = br.readLine();

// set store unique values
HashSet<String> hs = new HashSet<String>();

// loop for each line of input.txt
while(line != null)
{
// write only if not
// present in hashset
if(hs.add(line))
pw.println(line);

line = br.readLine();

}

pw.flush();

// closing resources
br.close();
pw.close();

System.out.println("File operation performed successfully");
}
}

  

 

 

 

 

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: