您的位置：首页 > 编程语言 > Python开发

python 解析XML（拼合互联网资料学习整理）

2013-12-12 14:17 471 查看

一、首先了解XML
如果你已经了解xml，可以跳过这一部分。
xml是一种描述层次结构化数据的通用方法。xml文档包含由起始和结束标签(tag)分隔的一个或多个元素(element)。以下也是一个完整的(虽然空洞)xml文件：

<foo>   ①
</foo>  ②

①	这是 foo 元素的起始标签。
②	这是 foo 元素对应的结束标签。就如写作、数学或者代码中需要平衡括号一样，每一个起始标签必须有对应的结束标签来闭合（匹配）。

元素可以嵌套到任意层次。位于

foo

中的元素

bar

可以被称作其子元素。

<foo>
<bar></bar>
</foo>

xml文档中的第一个元素叫做根元素(root element)。并且每份xml文档只能有一个根元素。以下不是一个xml文档，因为它存在两个“根元素”。

<foo></foo>
<bar></bar>

元素可以有其属性(attribute)，它们是一些名字-值(name-value)对。属性由空格分隔列举在元素的起始标签中。一个元素中属性名不能重复。属性值必须用引号包围起来。单引号、双引号都是可以。

<foo lang='en'>                          ①
<bar id='papayawhip' lang="fr"></bar>  ②
</foo>

①

foo

元素有一个叫做

lang

的属性。

lang

的值为

en

②

bar

元素则有两个属性，分别为

id

和

lang

。其中

lang

属性的值为

fr

。它不会与

foo

的那个属性产生冲突。每个元素都其独立的属性集。

如果元素有多个属性，书写的顺序并不重要。元素的属性是一个无序的键-值对集，跟Python中的列表对象一样。另外，元素中属性的个数是没有限制的。
元素可以有其文本内容(text content)

<foo lang='en'>
<bar lang='fr'>PapayaWhip</bar>
</foo>

如果某一元素既没有文本内容，也没有子元素，它也叫做空元素。

<foo></foo>

表达空元素有一种简洁的方法。通过在起始标签的尾部添加

字符，我们可以省略结束标签。上一个例子中的xml文档可以写成这样：

<foo/>

就像Python函数可以在不同的模块(modules)中声明一样，也可以在不同的名字空间(namespace)中声明xml元素。xml文档的名字空间通常看起来像URL。我们可以通过声明

xmlns

来定义默认名字空间。名字空间声明跟元素属性看起来很相似，但是它们的作用是不一样的。

<feed xmlns='http://www.w3.org/2005/Atom'>  ①
<title>dive into mark</title>             ②
</feed>

①

feed

元素处在名字空间

http://www.w3.org/2005/Atom

中。

②

title

元素也是。名字空间声明不仅会作用于当前声明它的元素，还会影响到该元素的所有子元素。

也可以通过

xmlns:prefix

声明来定义一个名字空间并取其名为prefix。然后该名字空间中的每个元素都必须显式地使用这个前缀(prefix)来声明。

<atom:feed xmlns:atom='http://www.w3.org/2005/Atom'>  ①
<atom:title>dive into mark</atom:title>             ②
</atom:feed>

①

feed

元素属于名字空间

http://www.w3.org/2005/Atom

。

②

title

元素也在那个名字空间。

对于xml解析器而言，以上两个xml文档是一样的。名字空间 + 元素名 = xml标识。前缀只是用来引用名字空间的，所以对于解析器来说，这些前缀名(

atom:

)其实无关紧要的。名字空间相同，元素名相同，属性（或者没有属性）相同，每个元素的文本内容相同，则xml文档相同。
最后，在根元素之前，字符编码信息可以出现在xml文档的第一行。（这里存在一个两难的局面(catch-22)，直观上来说，解析xml文档需要这些编码信息，而这些信息又存在于xml文档中，如果你对xml如何解决此问题有兴趣，请参阅xml规范中
F 章节）

<?xml version='1.0' encoding='utf-8'?>

1 XML 的声明

<?xml version=”1.0” standalone=”yes” encoding=”UTF-8”?>
这是一个XML处理指令。处理指令以 <? 开始，以 ?> 结束。<? 后的第一个单词是指令名，如xml, 代表XML声明。
version, standalone, encoding 是三个特性，特性是由等号分开的名称-数值对，等号左边是特性名称，等号右边是特性的值，用引号引起来。

几点解释:
version: 说明这个文档符合1.0规范
standalone: 说明文档在这一个文件里还是需要从外部导入, standalone 的值设为yes 说明所有的文档都在这一文件里完成
encoding: 指文档字符编码

2 XML 根元素定义
XML文档的树形结构要求必须有一个根元素。根元素的起始标记要放在所有其它元素起始标记之前，根元素的结束标记根放在其它所有元素的结束标记之后，如

<?xml version=”1.0” standalone=”yes” encoding=”UTF-8”?>

<Person>Zhang San</Person>

</Settings>

3 XML元素
元素的基本结构由开始标记，数据内容，结束标记组成，如

<Name>Zhang San</Name>

</Person>

需要注意的是:
元素标记区分大小写，<Name> 与 <name>是两个不同的标记
结束标记必须有反斜杠，如 </Name>
XML元素标记命名规则如下:
名字中可以包含字母，数字及其它字母
名字不能以数字或下划线开头
名字不能用xml开头
名字中不能包含空格和冒号

4 XML中的注释
XML中注释如下:

需要注意的是：
注释中不要出现“--”或“-”
注释不要放在标记中
注释不能嵌套

5 PI (Processing Instruction)
PI 指 Processing Instruction, 处理指令。PI以“<?”开头，以“?>”结束，用来给下游的文档传递信息。

<?xml:stylesheet href=”core.css” type=”text/css” ?>
例子表明这个XML文档用core.css控制显示。
参考 http://www.cnblogs.com/jb8164/articles/736515.html 简单讲解XML

现在我们已经知道足够多的xml知识，可以开始探险了！
二、python操作XML
例如
<?xml version='1.0' encoding='utf-8'?>

<feed
xmlns='http://www.w3.org/2005/Atom'
xml:lang='en'>

<title>dive into mark</title>

<subtitle>currently between addictions</subtitle>

<id>tag:diveintomark.org,2001-07-29:/</id>

<updated>2009-03-27T21:56:07Z</updated>

<link
rel='alternate'
type='text/html'
href='http://diveintomark.org/'/>

<link
rel='self'
type='application/atom+xml'
href='http://diveintomark.org/feed/'/>

<entry>

<author>

<name>Mark</name>

<uri>http://diveintomark.org/</uri>

</author>

<title>Dive into history, 2009 edition</title>

<link
rel='alternate'
type='text/html'

href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>

<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>

<updated>2009-03-27T21:56:07Z</updated>

<published>2009-03-27T17:20:42Z</published>

<category
scheme='http://diveintomark.org'
term='diveintopython'/>

<category
scheme='http://diveintomark.org'
term='docbook'/>

<category
scheme='http://diveintomark.org'
term='html'/>

<summary
type='html'>Putting an entire chapter on one page sounds

bloated, but consider this — my longest chapter so far

would be 75 printed pages, and it loads in under 5 seconds…

On dialup.</summary>

</entry>

<entry>

<author>

<name>Mark</name>

<uri>http://diveintomark.org/</uri>

</author>

<title>Accessibility is a harsh mistress</title>

<link
rel='alternate'
type='text/html'

href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>

<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>

<updated>2009-03-22T01:05:37Z</updated>

<published>2009-03-21T20:09:28Z</published>

<category
scheme='http://diveintomark.org'
term='accessibility'/>

<summary
type='html'>The accessibility orthodoxy does not permit people to

question the value of features that are rarely useful and rarely used.</summary>

</entry>

<entry>

<author>

<name>Mark</name>

</author>

<title>A gentle introduction to video encoding, part 1: container formats</title>

<link
rel='alternate'
type='text/html'

href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>

<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>

<updated>2009-01-11T19:39:22Z</updated>

<published>2008-12-18T15:54:22Z</published>

<category
scheme='http://diveintomark.org'
term='asf'/>

<category
scheme='http://diveintomark.org'
term='avi'/>

<category
scheme='http://diveintomark.org'
term='encoding'/>

<category
scheme='http://diveintomark.org'
term='flv'/>

<category
scheme='http://diveintomark.org'
term='GIVE'/>

<category
scheme='http://diveintomark.org'
term='mp4'/>

<category
scheme='http://diveintomark.org'
term='ogg'/>

<category
scheme='http://diveintomark.org'
term='video'/>

<summary
type='html'>These notes will eventually become part of a

tech talk on video encoding.</summary>

</entry>

</feed><?xml version='1.0' encoding='utf-8'?>

<feed
xmlns='http://www.w3.org/2005/Atom'
xml:lang='en'>

<title>dive into mark</title>

<subtitle>currently between addictions</subtitle>

<id>tag:diveintomark.org,2001-07-29:/</id>

<updated>2009-03-27T21:56:07Z</updated>

<link
rel='alternate'
type='text/html'
href='http://diveintomark.org/'/>

<link
rel='self'
type='application/atom+xml'
href='http://diveintomark.org/feed/'/>

<entry>

<author>

<name>Mark</name>

<uri>http://diveintomark.org/</uri>

</author>

<title>Dive into history, 2009 edition</title>

<link
rel='alternate'
type='text/html'

href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>

<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>

<updated>2009-03-27T21:56:07Z</updated>

<published>2009-03-27T17:20:42Z</published>

<category
scheme='http://diveintomark.org'
term='diveintopython'/>

<category
scheme='http://diveintomark.org'
term='docbook'/>

<category
scheme='http://diveintomark.org'
term='html'/>

<summary
type='html'>Putting an entire chapter on one page sounds

bloated, but consider this — my longest chapter so far

would be 75 printed pages, and it loads in under 5 seconds…

On dialup.</summary>

</entry>

<entry>

<author>

<name>Mark</name>

<uri>http://diveintomark.org/</uri>

</author>

<title>Accessibility is a harsh mistress</title>

<link
rel='alternate'
type='text/html'

href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>

<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>

<updated>2009-03-22T01:05:37Z</updated>

<published>2009-03-21T20:09:28Z</published>

<category
scheme='http://diveintomark.org'
term='accessibility'/>

<summary
type='html'>The accessibility orthodoxy does not permit people to

question the value of features that are rarely useful and rarely used.</summary>

</entry>

<entry>

<author>

<name>Mark</name>

</author>

<title>A gentle introduction to video encoding, part 1: container formats</title>

<link
rel='alternate'
type='text/html'

href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>

<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>

<updated>2009-01-11T19:39:22Z</updated>

<published>2008-12-18T15:54:22Z</published>

<category
scheme='http://diveintomark.org'
term='asf'/>

<category
scheme='http://diveintomark.org'
term='avi'/>

<category
scheme='http://diveintomark.org'
term='encoding'/>

<category
scheme='http://diveintomark.org'
term='flv'/>

<category
scheme='http://diveintomark.org'
term='GIVE'/>

<category
scheme='http://diveintomark.org'
term='mp4'/>

<category
scheme='http://diveintomark.org'
term='ogg'/>

<category
scheme='http://diveintomark.org'
term='video'/>

<summary
type='html'>These notes will eventually become part of a

tech talk on video encoding.</summary>

</entry>

</feed>

Python可以使用几种不同的方式解析xml文档。它包含了dom和sax解析器，但是我们焦点将放在另外一个叫做ElementTree的库上边。
跳过该代码清单

[隐藏] [在新窗口中打开] [download [code]feed.xml

]
>>> import xml.etree.ElementTree as etree ①
>>> tree = etree.parse('examples/feed.xml') ②
>>> root = tree.getroot() ③
>>> root ④
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>[/code]

①	ElementTree属于Python标准库的一部分，它的位置为 xml.etree.ElementTree 。
②	parse() 函数是ElementTree库的主要入口，它使用文件名或者流对象作为参数。 parse() 函数会立即解析完整个文档。如果内存资源紧张，也可以增量式地解析xml文档
③	parse() 函数会返回一个能代表整篇文档的对象。这不是根元素。要获得根元素的引用可以调用 getroot() 方法。
④	如预期的那样，根元素即 http://www.w3.org/2005/Atom 名字空间中的 feed 。该字符串表示再次重申了非常重要的一点：xml元素由名字空间和标签名（也称作本地名(local name)）组成。这篇文档中的每个元素都在名字空间Atom中，所以根元素被表示为 {http://www.w3.org/2005/Atom}feed 。

☞ElementTree使用

{namespace}localname

来表达xml元素。我们将会在ElementTree的api中多次见到这种形式。

元素即列表#

在ElementTree API中，元素的行为就像列表一样。列表中的项即该元素的子元素。
跳过该代码清单

[隐藏] [在新窗口中打开]# continued from the previous example
>>> root.tag                        ①
'{http://www.w3.org/2005/Atom}feed'
>>> len(root)                       ②
8
>>> for child in root:              ③
...   print(child)                  ④
...
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
<Element {http://www.w3.org/2005/Atom}entry at e2b750>

①	紧接前一例子，根元素为 {http://www.w3.org/2005/Atom}feed 。
②	根元素的“长度”即子元素的个数。
③	我们可以像使用迭代器一样来遍历其子元素。
④	从输出可以看到，根元素总共有8个子元素：所有feed级的元数据（ title ， subtitle ， id ， updated 和 link ），还有紧接着的三个 entry 元素。

也许你已经注意到了，但我还是想要指出来：该列表只包含直接子元素。每一个

entry

元素都有其子元素，但是并没有包括在这个列表中。这些子元素本可以包括在

entry

元素的列表中，但是确实不属于

feed

的子元素。但是，无论这些元素嵌套的层次有多深，总是有办法定位到它们的；在这章的后续部分我们会介绍两种方法。

属性即字典#

xml不只是元素的集合；每一个元素还有其属性集。一旦获取了某个元素的引用，我们可以像操作Python的字典一样轻松获取到其属性。
跳过该代码清单

[隐藏] [在新窗口中打开]# continuing from the previous example
>>> root.attrib                           ①
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
>>> root[4]                               ②
<Element {http://www.w3.org/2005/Atom}link at e181b0>
>>> root[4].attrib                        ③
{'href': 'http://diveintomark.org/',
'type': 'text/html',
'rel': 'alternate'}
>>> root[3]                               ④
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
>>> root[3].attrib                        ⑤
{}

①	attrib 是一个代表元素属性的字典。这个地方原来的标记语言是这样描述的： <feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'> 。前缀 xml: 指示一个内置的名字空间，每一个xml不需要声明就可以使用它。
②	第五个子元素 — 以0为起始的列表中即 [4] — 为元素 link 。
③	link 元素有三个属性： href ， type ，和 rel 。
④	第四个子元素 — [3] — 为 updated 。
⑤	元素 updated 没有子元素，所以 .attrib 是一个空的字典对象。元素为列表，可以用片段操作符操作，属性为字典，有键和值。

⁂

在XML文档中查找结点#

到目前为止，我们已经“自顶向下“地从根元素开始，一直到其子元素，走完了整个文档。但是许多情况下我们需要找到xml中特定的元素。Etree也能完成这项工作。
跳过该代码清单

[隐藏] [在新窗口中打开]>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('examples/feed.xml')
>>> root = tree.getroot()
>>> root.findall('{http://www.w3.org/2005/Atom}entry')    ①
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> root.tag
'{http://www.w3.org/2005/Atom}feed'
>>> root.findall('{http://www.w3.org/2005/Atom}feed')     ②
[]
>>> root.findall('{http://www.w3.org/2005/Atom}author')   ③
[]

①

findfall()

方法查找匹配特定格式的子元素。（关于查询的格式稍后会讲到。）

②

每个元素 — 包括根元素及其子元素 — 都有

findall()

方法。它会找到所有匹配的子元素。但是为什么没有看到任何结果呢？也许不太明显，这个查询只会搜索其子元素。由于根元素

feed

中不存在任何叫做

feed

的子元素，所以查询的结果为一个空的列表。

③

这个结果也许也在你的意料之外。在这篇文档中确实存在

author

元素；事实上总共有三个（每个

entry

元素中都有一个）。但是那些

author

元素不是根元素的直接子元素。我们可以在任意嵌套层次中查找

author

元素，但是查询的格式会有些不同。

跳过该代码清单

[隐藏] [在新窗口中打开]>>> tree.findall('{http://www.w3.org/2005/Atom}entry')    ①
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> tree.findall('{http://www.w3.org/2005/Atom}author')   ②
[]

①

为了方便，对象

tree

（调用

etree.parse()

的返回值）中的一些方法是根元素中这些方法的镜像。在这里，如果调用

tree.getroot().findall()

，则返回值是一样的。

②

也许有些意外，这个查询请求也没有找到文档中的

author

元素。为什么没有呢？因为它只是

tree.getroot().findall('{http://www.w3.org/2005/Atom}author')

的一种简洁表示，即“查询所有是根元素的子元素的

author

”。因为这些

author

是

entry

元素的子元素，所以查询没有找到任何匹配的。

find()

方法用来返回第一个匹配到的元素。当我们认为只会有一个匹配，或者有多个匹配但我们只关心第一个的时候，这个方法是很有用的。
跳过该代码清单

[隐藏] [在新窗口中打开]>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry')           ①
>>> len(entries)
3
>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title')  ②
>>> title_element.text
'Dive into history, 2009 edition'
>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo')      ③
>>> foo_element
>>> type(foo_element)
<class 'NoneType'>

①

在前一样例中已经看到。这一句返回所有的

atom:entry

元素。

②

find()

方法使用ElementTree作为参数，返回第一个匹配到的元素。

③

在

entries[0]

中没有叫做

foo

的元素，所以返回值为

None

。

☞可逮住你了，在这里

find()

方法非常容易被误解。在布尔上下文中，如果ElementTree元素对象不包含子元素，其值则会被认为是

False

（即如果

len(element)

等于0）。这就意味着

if
element.find('...')

并非在测试是否

find()

方法找到了匹配项；这条语句是在测试匹配到的元素是否包含子元素！想要测试

find()

方法是否返回了一个元素，则需使用

if
element.find('...') is not None

。

也可以在所有派生(descendant)元素中搜索，即任意嵌套层次的子元素，孙子元素等…
跳过该代码清单

[隐藏] [在新窗口中打开]>>> all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')  ①
>>> all_links
[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
<Element {http://www.w3.org/2005/Atom}link at e2b480>,
<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
>>> all_links[0].attrib                                              ②
{'href': 'http://diveintomark.org/',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[1].attrib                                              ③
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[2].attrib
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',
'type': 'text/html',
'rel': 'alternate'}
>>> all_links[3].attrib
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
'type': 'text/html',
'rel': 'alternate'}

①

//{http://www.w3.org/2005/Atom}link

与前一样例很相似，除了开头的两条斜线。这两条斜线告诉

findall()

方法“不要只在直接子元素中查找；查找的范围可以是任意嵌套层次”。

②

查询到的第一个结果是根元素的直接子元素。从它的属性中可以看出，它是一个指向该feed的html版本的备用链接。

③

其他的三个结果分别是低一级的备用链接。每一个

entry

都有单独一个

link

子元素，由于在查询语句前的两条斜线的作用，我们也能定位到他们。

总的来说，ElementTree的

findall()

方法是其一个非常强大的特性，但是它的查询语言却让人有些出乎意料。官方描述它为“有限的XPath支持。”XPath是一种用于查询xml文档的W3C标准。对于基础地查询来说，ElementTree与XPath语法上足够相似，但是如果已经会XPath的话，它们之间的差异可能会使你感到不快。现在，我们来看一看另外一个第三方xml库，它扩展了ElementTree的api以提供对XPath的全面支持。
重点：
用根元素，这样可以查找任意元素。
查找第一个节点，用find
查找一个元素的所有直接子元素findall()
加上两条斜线。这两条斜线告诉

findall()

方法“不要只在直接子元素中查找；查找的范围可以是任意嵌套层次”。
Element中的遍历与查询
Element.iter(tag=None)：遍历该Element所有后代，也可以指定tag进行遍历寻找。

Element.findall(path)：查找当前元素下tag或path能够匹配的直系节点。

Element.find(path)：查找当前元素下tag或path能够匹配的首个直系节点。

Element.text: 获取当前元素的text值。

Element.get(key, default=None)：获取元素指定key对应的属性值，如果没有该属性，则返回default值。

参考：
/article/5248707.html Python标准库之xml.etree.ElementTree

http://www.w3school.com.cn/xmldom/dom_methods.asp XML DOM - 属性和方法

/article/1588518.html 例子

/article/10464810.html Python_使用ElementTree解析xml文件

http://www.cnblogs.com/jb8164/articles/736515.html 简单讲解XML

http://woodpecker.org.cn/diveintopython3/xml.html 系统讲XML解析和python解析不错

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航