您的位置:首页 > 其它

xpath 使用方法 演示

2013-03-10 00:11 351 查看
翻译文章,翻译的不是很好,请见谅

原文地址:http://manual.calibre-ebook.com/xpath.html

In this tutorial, you will be given a gentle introduction to XPath,
a query language that can be used to select arbitrary parts of HTML documents
in calibre. XPath is a widely used standard, and googling it will yield a ton of information. This tutorial, however, focuses on using XPath for ebook related tasks like finding chapter headings in an unstructured HTML document.

在这个指南里,你会得到1个XPath的入门介绍。XPath是用来查询HTML文档任意部分的一种查询语言。Xpath是一种广泛使用的标准,google一下你会发现大量的相关介绍。这篇指南,关注于使用XPath去查询非标准HTML文档的电子书中的章节目录。

The simplest form of selection is to select tags by name. For example, suppose
you want to select all the <h2> tags
in a document. The XPath query for this is simply:

最简单表单查询是根据标签名来查询。举例,如果你想查询1片文档里面所有<h2>的标签,Xpath查询语句看起来很简单:

//h:h2        (Selects all <h2> tags)


The prefix // means search
at any level of the document. Now suppose you want to search for <span> tags
that are inside <a> tags.
That can be achieved with:

//前缀的意思是查找文档的任意级别。现在如果你想查询<a>标签内的<span>标签,你可以这样来获取:
//h:a/h:span    (Selects <span> tags inside <a> tags)


If you want to search for tags at a particular level in the document, change the prefix:
如果你想查询文档的特定的级别的标签。修改前缀:
/h:body/h:div/h:p (Selects <p> tags that are children of <div> tags that are
children of the <body> tag)


This will match only <p>A very short ebook to demonstrate the use of XPath.</p> in
the Sample ebook but not any of the other <p> tags.
The h: prefix in the above examples is needed to match XHTML tags. This is because internally, calibre represents all content
as XHTML. In XHTML tags have a namespace, and h: is the namespace prefix for HTML tags.

Now suppose you want to select both <h1> and <h2> tags.
To do that, we need a XPath construct called predicate. A predicate is simply a test that is used to select tags. Tests can be arbitrarily powerful and as this tutorial progresses, you will see more
powerful examples. A predicate is created by enclosing the test expression in square brackets:
这会仅仅匹配示例电子书中的标签<p>A very short ebook to demonstrate the use of XPath.</p>.
h:前缀在上面示例中需要匹配XHTML标签。因为这是内部的。XHTML上下文的表示方法。在XHTML标签中有1个命名空间。h:表示命名空间前缀是HTML标签的。
现在如果你想同时查询<h1>和<h2>标签,这样做。我们需要XPath构造所谓的语句。1条语句是用查询标签构成的简单的测试。测试可以在这篇指南里面反复执行.你会看到很多例子。1个语句通常创建在1个封闭的方括号中:
//*[name()='h1' or name()='h2']


There
are several new features in this XPath expression. The first is the use of the wildcard *.
It means match any tag. Now look at the test expression name()='h1'or name()='h2'. name() is
an example of a built-in function. It simply evaluates to the
name of the tag. So by using it, we can select tags whose names are either h1or h2.
Note that the name() function
ignores namespaces so that there is no need for the h: prefix.
XPath has several useful built-in functions. A few more will be introduced in this tutorial.

在XPath表达式中有几个特性。首先是*通配符。它表示匹配任何标签。现在看测试表达式
name()='h1' or name()='h2' 是1个内置函数的例子。他只检查标签的名字。所以使用它,我们可以查询命名为ht或者h2的标签。注意,name()函数忽略命名空间,所以使用时不需要h:前缀。XPath有几个内嵌函数。本篇指南里面会介绍几个。


Selecting by attributes

To select tags based on their attributes, the use of predicates is required:
通过标签的属性来查询,需要谓词:
//*[@style]              (Select all tags that have a style attribute)
//*[@class="chapter"]    (Select all tags that have class="chapter")
//h:h1[@class="bookTitle"] (Select all h1 tags that have class="bookTitle")


Here, the @ operator
refers to the attributes of the tag. You can use some of the XPath
built-in functions to perform more sophisticated matching on attribute values.

这里,@符号标志标签的属性。你可以用用一些XPath内嵌函数(属性值)来执行复杂的匹配。


Selecting by tag content

Using XPath, you can even select tags based on the text they contain. The best way to do this is to use the power of regular expressions via the built-in functionre:test():
用Xpath,你甚至可以通过标签包含的内容来查询。最好的实现方式是使用这个常规的内嵌函数re:test()来构造表达式
//h:h2[re:test(., 'chapter|section', 'i')] (Selects <h2> tags that contain the words chapter or
section)


Here the . operator
refers to the contents of the tag, just as the @ operator
referred to its attributes.

这里
符号。标志标签的内容,和@表示属性一样。


Sample ebook

<html>
<head>
<title>A very short ebook</title>
<meta name="charset" value="utf-8" />
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">Written by Kovid Goyal</p>
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>

<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>

<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>



XPath built-in functions

name()
The name of the current tag.
当前标签的名称contains()
contains(s1, s2) returns true if s1 contains s2.
返回s1包含s2的bool值re:test()
re:test(src, pattern, flags) returns true if the string src matches
the regular expression pattern. A particularly useful flag is i, it makes matching case insensitive. A good
primer on the syntax for regular expressions can be found at regexp syntax
返回src字符串匹配pattern是否成立。特别的表示是i。表示迟钝搜索。比较初级一点的语法可以查看...
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐