您的位置：首页 > 编程语言 > Python开发

python中如何解析xml文档

2008-12-20 14:11 671 查看

在实际的应用中，处理xml是很重要也很常用的，相应的处理方法也是多种多样的，本文专注于通用性的xml处理；但为了简单起见，仅包括python中的xml.dom.minidom模块。
xml.dom.minidom是python中处理xml的一个轻量级接口，但很实用。

1)创建xml对象
xml应用一般以创建xml对象为起点,使用minidom创建xml对象很简单，可以传入的参数有3类：文件名、文件对象、字符串
例如：
#-*-encoding:utf-8-*-
from xml.dom.minidom import parse, parseString

fileName = 'example.xml'

dom1 = parse(fileName) # parse an XML file by name

datasource = open(fileName)
dom2 = parse(datasource) # parse an open file

dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')

xml文档如下：
<?xml version='1.0' encoding='utf-8'?>
<parent>
<childs name='childs'>
<child name='1' />
<child name='2' />
</childs>
</parent>
从上文可以看出，使用字符串构造xml对象时，不需要第一行的xm文档声明；如果使用第一行的话，很不幸的，会抛出一个这样的异常：parser.Parse(string, True) xml.parsers.expat.ExpatError: XML or text declaration not at start of entity

具体地，其调用方式为：
1>xml.dom.minidom.parse(filename or file[, parse])
2>xml.dom.minidom.parseString(string[, parse])
上述两个函数会返回一个Document对象，上面的parse表示一个SAX2对象，什么意思，大家想想就明白了额。

注意：当xml操作完成之后，切记删除变量。因为某些版本的Python不支持循环引用变量的垃圾收集，清除dom变量可以使用dom对象的unlink()函数。
例如：
dom1.unlink()
dom2.unlink()
dom3.unlink()
2)xml.dom.minidom与DOM Level1标准
W3C推荐的DOM标准在Python的实现是由xml.dom.minidom支持的，但二者还是存在一些差别的，具体的
1>node.unlink()
2>node.writexml(writer[,
indent=""[, addindent=""[, newl=""[,
encoding=""]]]])
3>node.toxml([encoding])
4>node.toprettyxml([indent=""[, newl=""[,
encoding=""]]])

下面是python文档中给出的例子，简单、典型，发上来大家看看。
import xml.dom.minidom

document = """/
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>

<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""

dom = xml.dom.minidom.parseString(document)

def getText(nodelist):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc

def handleSlideshow(slideshow):
print("<html>")
handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
slides = slideshow.getElementsByTagName("slide")
handleToc(slides)
handleSlides(slides)
print("</html>")

def handleSlides(slides):
for slide in slides:
handleSlide(slide)

def handleSlide(slide):
handleSlideTitle(slide.getElementsByTagName("title")[0])
handlePoints(slide.getElementsByTagName("point"))

def handleSlideshowTitle(title):
print("<title>%s</title>" % getText(title.childNodes))

def handleSlideTitle(title):
print("<h2>%s</h2>" % getText(title.childNodes))

def handlePoints(points):
print("<ul>")
for point in points:
handlePoint(point)
print("</ul>")

def handlePoint(point):
print("<li>%s</li>" % getText(point.childNodes))

def handleToc(slides):
for slide in slides:
title = slide.getElementsByTagName("title")[0]
print("<p>%s</p>" % getText(title.childNodes))

handleSlideshow(dom)

另外，xml.dom.minidom也有一些没有实现的东西，例如：
DOMTimeStamp

DocumentType

DOMImplementation

CharacterData

CDATASection

Notation

Entity

EntityReference

DocumentFragment

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航