您的位置：首页 > Web前端 > JavaScript

HTML页面解析组件-Jsoup使用

2016-11-08 10:20 363 查看

原文地址: http://blog.sina.com.cn/s/blog_7227719a0100lpix.html

java端解析HTML页面内容

Jsoup把HTML的解析变为DOM的方式，类似于在HTML页面中直接用JS操作。

使用方法：

Document doc = Jsoup.parse(new URL(“http://www.baidu.com”),5000);

这是从一个URL地址获取HTML页面内容，然后直接处理成一个DOM的对象。当然，也可以传入已有的HTML页面String，

甚至于File对象，输入流对象。

元素用Element对象封装

元素集合用Elements对象封装（LinkedHashSet）

Elements elems = doc.getElementsByTagName_r("A");

或

Elemens elems = doc.getElemensByName("name”);

。。。

最方便的莫过于类似于XPATH的select方法

Elements elems = doc.select(“A[href^=http]”); //href 以http开头

更多规则：

Selector overview

tagname

: find elements by tag, e.g.

ns|tag

: find elements by tag in a namespace, e.g.

fb|name

finds

<fb:name>

elements

#id

: find elements by ID, e.g.

#logo

.class

: find elements by class name, e.g.

.masthead

[attribute]

: elements with attribute, e.g.

[href]

[^attr]

: elements with an attribute name prefix, e.g.

[^data-]

finds elements with HTML5 dataset attributes

[attr=value]

: elements with attribute value, e.g.

[width=500]

[attr^=value]

[attr$=value]

[attr*=value]

: elements with attributes that start with, end with, or contain the value, e.g.

[href*=/path/]

[attr~=regex]

: elements with attribute values that match the regular expression;e.g.

img[src~=(?i)\.(png|jpe?g)]

: all elements, e.g.

Selector combinations

el#id

: elements with ID, e.g.

div#logo

el.class

: elements with class, e.g.

div.masthead

el[attr]

: elements with attribute, e.g.

a[href]

Any combination, e.g.

a[href].highlight

ancestor child

: child elements that descend from ancestor, e.g.

.body p

finds

elements anywhere under a block with class "body"

parent > child

: child elements that descend directly from parent, e.g.

div.content > p

finds

elements; and

body > *

finds
the direct children of the body tag

siblingA + siblingB

: finds sibling B element immediately preceded by sibling A, e.g.

div.head + div

siblingA ~ siblingX

: finds sibling X element preceded by sibling A, e.g.

h1 ~ p

el, el, el

: group multiple selectors, find unique elements that match any of the selectors; e.g.

div.masthead, div.logo

Pseudo selectors

:lt(n)

: find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than

; e.g.

td:lt(3)

:gt(n)

: find elements whose sibling index is greater than

; e.g.

div p:gt(2)

:eq(n)

: find elements whose sibling index is equal to

; e.g.

form input:eq(1)

:has(seletor)

: find elements that contain elements matching the selector; e.g.

div:has(p)

:not(selector)

: find elements that do not match the selector; e.g.

div:not(.logo)

:contains(text)

: find elements that contain the given text. The search is case-insensitive; e.g.

p:contains(jsoup)

:containsOwn(text)

: find elements that directly contain the given text

:matches(regex)

: find elements whose text matches the specified regular expression; e.g.

div:matches((?i)login)

:matchesOwn(regex)

: find elements whose own text matches the specified regular expression
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, etc

See the

Selector

API reference for the full supported list and details.

优点：

1、使用非常简单，类似于JS操作DOM，很直观，熟悉

2、选择器很强大，可以很方便的查找元素

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航