您的位置：首页 > 编程语言 > ASP

Aspose.words编程指南之DOM树再识，各层结构之间的关系

2015-09-01 10:58 579 查看

转载请注明 /article/1363990.html

上一篇Aspose.words编程指南之DOM树结构初识，Node类继承关系及说明我运行了第一个简单的应用，并且讲述了它的加载、保存和转换方式。然后从它设计理念，讲解了DOM的概念。这一篇将会继续讲解DOM基本概念和节点之间获取方式。我相信，如果你仔细看了这篇博客，会对Aspose.words的DOM结构有个深入的了解。

Document Object Model

结构图

上一篇内容中，我们从一个例子入手，大概了解了DOM树的概念，并讲解了Node类的继承关系和概念。这里，我们从结构图宏观的查看下它们之间的关系。

Document和Section

在上一篇中，我们知道：

Document是文档树的根节点，提供访问整个文档的入口

Section对象对应文档中的一节

也就是说，Document里包含一个或多个Section。

这里，我们从一张图，来看看这两者之间的关系。

这张图很清晰的描述了Document和Section在文档中扮演的角色。我们来罗列一下：

1.一个Document里包含一个或多个Section。

2.一个Section里包含一个Body node和0个或多个HeaderFooter nodes

3.Body和HeaderFooter node都包含0个或多个Block-level Nodes。

4.一个Document里可以包含0个或1个GlossaryDocument。

一般来说，一个word文档包含一个或多个Sections。一个Section可以定自己的页面大小，边距，文字方向排布，文本栏数，也可以定义headers 和footers。一个文档的Sections由section breaks分隔。

一个Section包含主要文本和头注、脚注。这些统一被称为”Stories”。在Aspose.words里，Section node包含Story node : Body & HeaderFooter。主要文本被管理在Body里，头注和脚注被管理在HeaderFooter里。

任何文本story包含一个或多个paragraph和一个或多个table。这部分被称为Block-level nodes。

此外，一个Document可以包含一个glossary document。一个glossary document保存0个或多个building blocks。也就是说一个Document可以包含一个GlossaryDocument。一个GlossaryDocument包含0个或多个BuildingBlock，每个BuildingBlock包含0个或多个Section，并能管理这些Section做插入，拷贝和移除等动作。

Block-level Nodes

上面说到，任何文本story包含一个或多个paragraph和一个或多个table。这部分就是Block-level Nodes。

我们先贴张图直观的看下里面包含了什么。

我们来分析一下上面的结构图：

1.Block-level Nodes在一个DOM树里可以出现在很多节点里。这里罗列了八个(Body, HeaderFooter, Footnote, Comment, Shape, Cell, CustomXmlMarkup, StructuredDocumentTag)

2.Block-level Nodes里最重要的两个是：table和paragraph。

3.一个Table包含0个或多个Row-level Nodes。

4.一个Paragraph包含0个或多个Inline-level Nodes。

5.CustomXmlMarkup和StructuredDocumentTag还可以嵌套Block-level Nodes。

用通俗一点的话来说，Block-level Nodes就是文档里的一块内容，它可以包含表格和段落。表格里有一行一行的内容，段落里有内置的一行一行的内容。它还可以包含标签，标签里还可以嵌套包含一块内容。

这样讲是不是清晰了点，不过为了能更简单的查看代码，我们还是尽量用术语来定义，习惯了就好。如果实在不懂，可以查看上一篇关于Node类的定义说明。

Inline-level Nodes

我们从上面可以看到，一个Paragraph包含0个或多个Inline-level Nodes。这里，我们来看看Inline-level Nodes。

照例先上张图。

哇~这一层里有这么多成员！没事，我们慢慢来看。

我们继续总结一下上面的结构图：

1.Paragraph，Smarttag，CustomXmlMarkup，StructuredDocumentTag这几个节点都可以包含Inline-level节点。这里面，最经常包含Inline-level节点的是Paragraph。

2.Paragraph可以包含多个run节点，每个run节点的格式可以都不一样。

3.Paragraph可以包含书签，BookmarkStart和BookmarkEnd。

4.Paragraph可以包含注释，CommentRangeStart、CommentRangeEnd、Comment和Footnote。

5.Paragraph可以包含Word fields（这一块不熟，如果word用到过域应该会更容易理解这块），FieldStart、FieldSeparator、FieldEnd和FormField

6.Paragraph可以包含shapes, drawings, images等，通过Shape和GroupShape节点。

7.Paragraph可以包含标签，SmartTag, CustomXmlMarkup和StructuredDocumentTag

GroupShape可以包含Shape或者继续嵌套GroupShape；Shape，Footnode，Comment可以包含Block-level节点；

好吧，内嵌层的成员很多，部分成员可以继续嵌套内嵌层的成员，也有部分成员可以内嵌块层。关系还蛮复杂的~~

Table, Row and Cell

在块层里，我们有提到过，一个Table包含0个或多个Row-level Nodes。我们接下去看看这两者之间的关系。

照例上图先~

我们继续看图说话。

1.一个Table可以包含很多Row

2.一个Row可以包含很多Cell

3.Cell可以继续包含块层节点。尼玛，又能嵌套~

4.CustomXmlMarkup和StructuredDocumentTag这两个分别是Block-level、Row-level和Cell-level的成员。也就是说，可以在这几层不断嵌套。

Document树查看

Document Explorer

既然Aspose.words会把一个文档解析成一个DOM数，那么是否有工具可以清晰的查看某个word文档的DOM树结构呢？

我们可以自己写啊！通过解析，再添加log。这显然是个很不错的想法。不过，其实是有现成的。

我们可以去它的官网找DEMO，里面包含Document explorer。Aspose.Words for Java libs & examples。很遗憾的是，目前android端有用的例子只有可怜的一个：DocumentViewer。其功能是用来查看word文档。所以，我只能从其他平台的例子里去找了。发现.NET和java都有Document explorer，于是乎搭建环境，把java的例子跑起来看了下，效果还行。

Tree Nodes

我们再来看张UML图：

就像树形结构的特点，每个节点都处在一个树中，它会有一定的关系。

比方说图中的Node，它包含了CompositeNode和Inline，所以它相对于CompositeNode和Inline是父节点，而CompositeNode和Inline是它的子节点。CompositeNode和Inline有相同的父节点，所以它们是兄弟节点。记住，Document节点永远是一个文档DOM树的根节点。

兄弟节点之间是有先后关系的，就像上面Document explorer解析的一个文档，body下有很多paragraph，它们有先后顺序。我们再看张图：

上图也展示了兄弟节点之间的先后顺序关系。我们可以看到Body下有两个子节点：Paragraph和Table，它们的先后顺序通过数组游标确定。

只有继承自CompositeNode，才能做为父节点包含子节点；如果继承自Node，是无法包含子节点的。

接下去，我们再来看看Node节点的关系在代码里怎么体现。

Parent Node

查看一个节点的父节点，我们可以通过Node.ParentNode这个属性。如果一个Node刚创建出来，还没添加到dom树里；或者一个节点从DOM树里移除了，那么它是没有父节点的，Node.ParentNode此时为null。你可以通过在父节点调用Node.Remove来移除它的子节点。

根节点当然是没有父节点的。

如下代码展示怎么获得父节点：

// Create a new empty document. It has one section.
Document doc = new Document();

// The section is the first child node of the document.
Node section = doc.getFirstChild();

// The section's parent node is the document.
System.out.println("Section parent is the document: " + (doc == section.getParentNode()));

Owner Document

这里需要强调一点的是，一个Node(节点)是永远要属于某个Document的，哪怕它是刚被创建出来还是已经被移除出DOM树。我们可以通过该节点的Node.Document查看它所属的Document。

我们通过一个例子来看看：

// Open a file from disk.
Document doc = new Document();

// Creating a new node of any type requires a document passed into the constructor.
Paragraph para = new Paragraph(doc);

// The new paragraph node does not yet have a parent.
System.out.println("Paragraph has no parent node: " + (para.getParentNode() == null));

// But the paragraph node knows its document.
System.out.println("Both nodes' documents are the same: " + (para.getDocument() == doc));

// The fact that a node always belongs to a document allows us to access and modify
// properties that reference the document-wide data such as styles or lists.
para.getParagraphFormat().setStyleName("Heading 1");

// Now add the paragraph to the main text of the first section.
doc.getFirstSection().getBody().appendChild(para);

// The paragraph node is now a child of the Body node.
System.out.println("Paragraph has a parent node: " + (para.getParentNode() != null));

上述代码，我们可以看到，在创建某个节点时，会马上传入Document对象，这样，它就保存了根节点。此时，它是没有父节点的。在后面，通过doc.getFirstSection().getBody().appendChild(para);才有了父节点。

Child Nodes

最有效的查找子节点的方式是通过CompositeNode的CompositeNode.FirstChild和CompositeNode.LastChild这两个属性，如果没有子节点，会返回null。

CompositeNode也提供了CompositeNode.ChildNodes的collection，可以方便我们遍历。

不过，我们需要注意的是CompositeNode.ChildNodes是在不断动态变化的，在每次添加或者移除时，就会更新它。关于这一块，我会在后面章节再展开讲解。

你可以通过CompositeNode.HasChildNodes这个属性，直接查看某个节点是否有子节点。

下面的例子，展示了遍历子节点的方法：

NodeCollection children = paragraph.getChildNodes();
for (Node child : (Iterable<Node>) children)
{
    // Paragraph may contain children of various types such as runs, shapes and so on.
    if (child.getNodeType() == NodeType.RUN)
    {
        // Say we found the node that we want, do something useful.
        Run run = (Run)child;
        System.out.println(run.getText());
    }
}

NodeCollection children = paragraph.getChildNodes();
for (int i = 0; i < children.getCount(); i++)
{
    Node child = children.get(i);

    // Paragraph may contain children of various types such as runs, shapes and so on.
    if (child.getNodeType() == NodeType.RUN)
    {
        // Say we found the node that we want, do something useful.
        Run run = (Run)child;
        System.out.println(run.getText());
    }
}

上一篇有说到NodeType，在这里使用是最合适的。

Sibling Nodes

关于兄弟节点，我先前有说过，它们是有先后顺序的。就像是家里兄弟姐妹一样，出生有先后。

那么怎么获取哥哥\姐姐或者弟弟\妹妹呢？通过Node.PreviousSibling和Node.NextSibling这两个属性。如果是最小的，那么它的Node.NextSibling就会是null；同理，如果是最大的，那么它的Node.PreviousSibling就会是null。

注意点，兄弟节点在Aspose内部是通过单链去维护的，所以Node.NextSibling比Node.PreviousSibling高效的多。

下面代码展示了如何遍历某个节点开始往下的所有的子节点

public void recurseAllNodes() throws Exception
{
    // Open a document.
    Document doc = new Document(getMyDir() + "Node.RecurseAllNodes.doc");

    // Invoke the recursive function that will walk the tree.
    traverseAllNodes(doc);
}

/**
 * A simple function that will walk through all children of a specified node recursively
 * and print the type of each node to the screen.
 */
public void traverseAllNodes(CompositeNode parentNode) throws Exception
{
    // This is the most efficient way to loop through immediate children of a node.
    for (Node childNode = parentNode.getFirstChild(); childNode != null; childNode = childNode.getNextSibling())
    {
        // Do some useful work.
        System.out.println(Node.nodeTypeToString(childNode.getNodeType()));

        // Recurse into the node if it is a composite node.
        if (childNode.isComposite())
            traverseAllNodes((CompositeNode)childNode);
    }
}

childNode.isComposite()

这个是查看该节点是否是CompositeNode。

Typed Access to Children and Parent

可以看到，在上面的代码例子里，我们遍历的时候，获得某个节点，需要通过NodeType去判断该节点的类型，然后进行强转。

如果你不喜欢这种暴力编程方式，可以通过以下方式：

1.父节点公开了明确的子节点类型FirstXXX和LastXXX属性。比方说，Document里有Document.FirstSection和Document.LastSection。同样的，Table里也有Table.FirstRow和Table.LastRow。

2.当然，父节点也公开了子节点类型的collection，比方说，Document.Sections, Body.Paragraphs等等。

3.子节点也公开了父节点类型，比方说，Run.ParentParagraph和Paragraph.ParentSection.

如下代码展示这种方案：

// Quick typed access to the first child Section node of the Document.
Section section = doc.getFirstSection();

// Quick typed access to the Body child node of the Section.
Body body = section.getBody();

// Quick typed access to all Table child nodes contained in the Body.
TableCollection tables = body.getTables();

for (Table table : tables)
{
    // Quick typed access to the first row of the table.
    if (table.getFirstRow() != null)
        table.getFirstRow().remove();

    // Quick typed access to the last row of the table.
    if (table.getLastRow() != null)
        table.getLastRow().remove();
}

好了，关于Aspose.words的DOM结构就讲到这里。下一篇Aspose.words编程指南之Working with Document深入讲解最核心的Node类—Document。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航