Python3标准库：xml.etree.ElementTree XML操纵API(2)

当前位置:

首页 > Python基础教程 >

Python3标准库：xml.etree.ElementTree XML操纵API(2)

from xml.etree import ElementTree

with open('podcasts.opml', 'rt') as f:

tree = ElementTree.parse(f)

for node in tree.findall('.//outline/outline'):

url = node.attrib.get('xmlUrl')

print(url)

输入中所有嵌套深度为两层的outline节点都认为有一个xmlURL属性指向播客提要，所以循环在使用这个属性之前可以不做检查。

不过，这个版本仅限于当前的这个结构，所以如果outline节点重新组织为一个更深的树，那么这个版本就无法正常工作了。

1.4 解析节点属性

findall()和iter()返回的元素是Element对象，各个对象分别表示XML解析树中的一个节点。每个Element都有一些属性可以用来获取XML中的数据。可以用一个稍有些牵强的示例输入文件data.xml来说明这种行为。

<?xml version="1.0" encoding="UTF-8"?>
<top>
<child>Regular text.</child>
<child_with_tail>Regular text.</child_with_tail>"Tail" text.
<with_attributes name="value" foo="bar"/>
<entity_expansion attribute="This & That">
That & This
</entity_expansion>
</top>

可以由attrib属性得到节点的XML属性，attrib属性就像是一个字典。

from xml.etree import ElementTree
with open('data.xml', 'rt') as f:
tree = ElementTree.parse(f)
node = tree.find('./with_attributes')
print(node.tag)
for name,value in sorted(node.attrib.items()):
print(name,value)

输入文件第5行上的节点有两个属性name和foo。

还可以得到节点的文本内容，以及结束标记后面的tail文本。

from xml.etree import ElementTree
with open('data.xml', 'rt') as f:
tree = ElementTree.parse(f)
for path in ['./child','./child_with_tail']:
node = tree.find(path)
print(node.tag)
print('child node text:',node.text)
print('and tail text:',node.tail)

第3行上的child节点包含嵌入文本，第4行的节点包含带tail的文本(包括空白符)。

返回值之前，文档中嵌入的XML实体引用会被转换为适当的字符。

from xml.etree import ElementTree
with open('data.xml', 'rt') as f:
tree = ElementTree.parse(f)
node = tree.find('entity_expansion')
print(node.tag)
print('in attribute:',node.attrib['attribute'])
print('in text:',node.text.strip())

这个自动转换意味着可以忽略XML文档中表示某些字符的实现细节。

1.5 解析时监视事件

另一个处理XML文档的API是基于事件的。解析器为开始标记生成start事件，为结束标记生成end事件。解析阶段中可以通过迭代处理事件流从文档抽取数据，如果以后没有必要处理整个文档，或者没有必要将解析文档都保存在内存中，那么基于事件的API就会很方便。

有以下事件类型：

start遇到一个新标记。会处理标记的结束尖括号，但不处理内容。

end已经处理结束标记的结束尖括号。所有子节点都已经处理。

start-ns结束一个命名空间声明。

end-ns结束一个命名空间声明。

iterparse()返回一个iterable，它会生成元组，其中包含事件名和触发事件的节点。

from xml.etree.ElementTree import iterparse
depth = 0
prefix_width = 8
prefix_dots = '.' * prefix_width
line_template = '.'.join([
'{prefix:<0.{prefix_len}}',
'{event:<8}',
'{suffix:<{suffix_len}}',
'{node.tag:<12}',
'{node_id}',
])
EVENT_NAMES = ['start','end

栏目列表