您的位置:首页 > Web前端 > JavaScript

HTML页面解析组件-Jsoup使用

2016-11-08 10:20 363 查看
原文地址: http://blog.sina.com.cn/s/blog_7227719a0100lpix.html

java端解析HTML页面内容

Jsoup把HTML的解析变为DOM的方式,类似于在HTML页面中直接用JS操作。

使用方法:

Document doc = Jsoup.parse(new URL(“http://www.baidu.com”),5000);

这是从一个URL地址获取HTML页面内容,然后直接处理成一个DOM的对象。当然,也可以传入已有的HTML页面String,

甚至于File对象,输入流对象。

元素用Element对象封装

元素集合用Elements对象封装(LinkedHashSet)

Elements elems = doc.getElementsByTagName_r("A");



Elemens elems = doc.getElemensByName("name”);

。。。

最方便的莫过于类似于XPATH的select方法

Elements elems  = doc.select(“A[href^=http]”); //href 以http开头

更多规则:

Selector overview

tagname
: find elements by tag, e.g. 
a

ns|tag
: find elements by tag in a namespace, e.g. 
fb|name
 finds 
<fb:name>
 elements
#id
: find elements by ID, e.g. 
#logo

.class
: find elements by class name, e.g. 
.masthead

[attribute]
: elements with attribute, e.g. 
[href]

[^attr]
: elements with an attribute name prefix, e.g. 
[^data-]
 finds elements with HTML5 dataset attributes
[attr=value]
: elements with attribute value, e.g. 
[width=500]

[attr^=value]
[attr$=value]
[attr*=value]
: elements with attributes that start with, end with, or contain the value, e.g. 
[href*=/path/]

[attr~=regex]
: elements with attribute values that match the regular expression;e.g.
img[src~=(?i)\.(png|jpe?g)]

*
: all elements, e.g. 
*


Selector combinations

el#id
: elements with ID, e.g. 
div#logo

el.class
: elements with class, e.g. 
div.masthead

el[attr]
: elements with attribute, e.g. 
a[href]

Any combination, e.g. 
a[href].highlight

ancestor child
: child elements that descend from ancestor, e.g. 
.body p
 finds 
p
 elements anywhere under a block with class "body"
parent > child
: child elements that descend directly from parent, e.g. 
div.content > p
finds 
p
 elements; and 
body > *
 finds
the direct children of the body tag
siblingA + siblingB
: finds sibling B element immediately preceded by sibling A, e.g.
div.head + div

siblingA ~ siblingX
: finds sibling X element preceded by sibling A, e.g. 
h1 ~ p

el, el, el
: group multiple selectors, find unique elements that match any of the selectors; e.g. 
div.masthead, div.logo


Pseudo selectors

:lt(n)
: find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than 
n
; e.g. 
td:lt(3)

:gt(n)
: find elements whose sibling index is greater than 
n
; e.g. 
div p:gt(2)

:eq(n)
: find elements whose sibling index is equal to 
n
; e.g. 
form input:eq(1)

:has(seletor)
: find elements that contain elements matching the selector; e.g. 
div:has(p)

:not(selector)
: find elements that do not match the selector; e.g. 
div:not(.logo)

:contains(text)
: find elements that contain the given text. The search is case-insensitive; e.g. 
p:contains(jsoup)

:containsOwn(text)
: find elements that directly contain the given text
:matches(regex)
: find elements whose text matches the specified regular expression; e.g.
div:matches((?i)login)

:matchesOwn(regex)
: find elements whose own text matches the specified regular expression
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

See the 
Selector
 API reference for the full supported list and details.

优点:

1、使用非常简单,类似于JS操作DOM,很直观,熟悉

2、选择器很强大,可以很方便的查找元素
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: