网页爬去 String转为Html形式
2015-11-27 17:42
411 查看
public static String replaceByPattern(String html, String url, Pattern pattern) { StringBuilder stringBuilder = new StringBuilder(); Matcher matcher = pattern.matcher(html); int lastEnd = 0; boolean modified = false; while (matcher.find()) { modified = true; stringBuilder.append(StringUtils.substring(html, lastEnd, matcher.start())); stringBuilder.append(matcher.group(1)); stringBuilder.append("\"").append(canonicalizeUrl(matcher.group(2), url)).append("\""); lastEnd = matcher.end(); } if (!modified) { return html; } stringBuilder.append(StringUtils.substring(html, lastEnd)); return stringBuilder.toString(); }
private static Pattern patternForHrefWithQuote = Pattern.compile("(<a[^<>]*href=)[\"']([^\"'<>]*)[\"']", Pattern.CASE_INSENSITIVE); private static Pattern patternForHrefWithoutQuote = Pattern.compile("(<a[^<>]*href=)([^\"'<>\\s]+)", Pattern.CASE_INSENSITIVE); public static String fixAllRelativeHrefs(String html, String url) { html = replaceByPattern(html, url, patternForHrefWithQuote); html = replaceByPattern(html, url, patternForHrefWithoutQuote); return html; }
public static Html getHtml(String body,String url) { Html html; html = new Html(UrlUtils.fixAllRelativeHrefs(body,url)); return html; }
相关文章推荐
- JMX -- JDMK实现HTML页面触发后台方法
- 在父页面获取子页面元素值的方法
- HTML图像标记
- HTML文档设置标记
- emacs org文档转换成html
- HTML字符转码
- HTML(2)
- [开源框架推荐]pdf2htmlEX: 高保真PDF至HTML转换
- Html.fromHtml(str)
- 关于浏览器window、document、html、body高度的探究
- HTML
- github 预览html
- 自适应html,在调整一下html,做成淘宝
- 对HTML6的未来有何感想
- 点击哪个html标签获取该标签的ID
- HTML 事件属性[w3school]
- 常见的HTML虚元素
- HTML中不注意的细节
- HTML-超级链接
- HTML颜色表示