您的位置:首页 > 编程语言 > Java开发

java根据网页URL获取正文信息,并调整正文格式为段落显示---(两种方式)

2017-07-07 11:50 525 查看
WebCollector的正文抽取API都被封装为ContentExtractor类的静态方法。可以抽取结构化新闻,也可以只抽取网页的正文(或正文所在Element)。
源码可在https://github.com/CrawlScript/WebCollector中下载,也可在https://github.com/CrawlScript/WebCollector中下载webcollector-version-bin.zip,解压后导入所有jar包。需要了解的两个类 :ContentExtractor : 封装了正文抽取算法和正文抽取的API,正文抽取API都被封装为ContentExtractor类的静态方法News : 结构化新闻对应的模型
package spiderWorker.testWebCollector;import java.io.BufferedReader;import java.io.ByteArrayOutputStream;import java.io.File;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.UnsupportedEncodingException;import java.net.HttpURLConnection;import java.net.URL;import cn.edu.hfut.dmic.contentextractor.ContentExtractor;import cn.edu.hfut.dmic.contentextractor.News;public class testdemo1 {/** *alt+shift+j* 通过网站域名URL获取该网站的源码 HTMl文件* @param url* @return String* @throws Exception*/public static String getURLSource(URL url) throws Exception    {HttpURLConnection conn = (HttpURLConnection)url.openConnection();conn.setRequestMethod("GET");conn.setConnectTimeout(5 * 1000);InputStream inStream =  conn.getInputStream();  //通过输入流获取html二进制数据byte[] data = readInputStream(inStream);        //把二进制数据转化为byte字节数据String htmlSource = new String(data);return htmlSource;}/*** 把二进制流转化为byte字节数组* @param instream* @return byte[]* @throws Exception*/public static byte[] readInputStream(InputStream instream) throws Exception {ByteArrayOutputStream outStream = new ByteArrayOutputStream();byte[]  buffer = new byte[1204];int len = 0;while ((len = instream.read(buffer)) != -1){outStream.write(buffer,0,len);}instream.close();return outStream.toByteArray();}public static void main(String[] args) throws Exception {URL url = new URL("http://www.sohu.com/a/154612018_555775"); //有给定的URL,得到html源码快照文件,不进行额外保存,直接进行正文格式变换和显示String urlsource = getURLSource(url);System.out.println(urlsource);News news = ContentExtractor.getNewsByHtml(urlsource);   //需要使用到WebCollector包:<dependency> <groupId>cn.edu.hfut.dmic.webcollector</groupId> <artifactId>WebCollector</artifactId> <version>2.52</version>  </dependency>String content = " "+news.getContent();String time = news.getTime();String title = news.getTitle();content  = content.replaceAll(" ", "\r\n\t");System.out.println(title);System.out.println(time);System.out.println(content);/*File file = new File("C:\\Users\\admin\\Desktop\\test1.txt"); //先将HTML文件保存在文件中,再读文件,进行正文格式变换和显示Strina4ddg encoding="UTF-8";InputStreamReader read = new InputStreamReader(new FileInputStream(file),encoding);//考虑到编码格式BufferedReader bufferedReader = new BufferedReader(read);StringBuilder sb = new StringBuilder();String lineTxt = null;while((lineTxt = bufferedReader.readLine()) != null){sb.append(lineTxt);}try {News news = ContentExtractor.getNewsByHtml(sb.toString());String content = " "+news.getContent();String time = news.getTime();String title = news.getTitle();content  = content.replaceAll(" ", "\r\n\t");System.out.println(title);System.out.println(time);System.out.println(content);} catch (Exception e) {e.printStackTrace();}*/}}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: