您的位置:首页 > 编程语言 > C语言/C++

《Data Structure And Algorithm Analysis In C++》读书笔记四

2018-02-01 09:50 726 查看

Chapter 4 Trees

This chapter discusses the data structure for which the average running time of most operation is O(logN).The data structure is the Binary search tree. In STL , it is implemented by std::set and std::map(red-black tree)* See how trees are used to implement the file system of several popular operating systems.* See how trees can be used to evaluate arithmetic expressions.* Show how to use trees to support searching operations in O(logN) average time and how to refine these ideas to obtain O(logN) worst-case bounds. We will see how to implement these operations when the data are stored on the disk.* Discuss and use the std::set and std::map classes.

4.1 Preliminaries



The recursive definitions of Tree, we find that a tree is a collection of N nodes, one of which is the root, and N - 1 edges. That there are N - 1 edges follows from the fact that each edge connects some node to its parent, and every node except the root has one parent.(We can use the mathematics induction to proof that the edge is always N - 1)


    A path from node n1 to nk is defined as a sequence of nodes n1, n2, ..., nk such that ni is the parent of ni+1 for 1 <= i < k.
    The length of this path is the number of edges on the path, namely, k-1. There is a path of length zero from every node to itself. Notice that in a tree there is exactly one path from the root to each node.
    For any node ni, the depth of ni is the length of the unique path from the root to ni. The root is at depth 0. 
    The height of ni is the length of the longest path from ni to a leaf. Thus all leaves are at height 0. The height of a tree is equal to the height of the root. For the tree in Figure 4.2,  E is at depth 1 and height 2(from E to longest leaf node  P or Q);  F is at depth 1(from root to F) and height 1(from F to leaf K or L or M); the height of the tree is 3(from root node to the longest leaf such as P or Q). The depth of the tree(notice this concept is not equal to the root node) is equal to the depth of the deepest leaf; this is always equal to the height of the tree.
    If there is a path from n1 to n2, then n1 is an ancestor of n2 and n2 is a descendant of n1. if n1 != n2, then n1 is aproper ancestor of n2 and n2 is aproper descendant of n1.

4.1.1 Implementation of Trees

template <typename Object>
struct TreeNode
{
Object element;
TreeNode *firstChild;
TreeNode *nextSibling;
};
Keep the children of each node in a linked list of three nodes. and keep the sibling of each node as another linklist.





4.1.2 Tree Traversals with an Application

One of the popular uses for tree is the directory structure in many common operating systems, indluding UNIX and DOS. Refer Fig 4.5 a typical directory in the UNIX file system.





    This traversal strategy is known as a preorder traversal (先序遍历). In a preorder traversal, work at a node is performed before (pre) its children are processed.  
    Time complexity analysis: line 1 and 2 is executed exactly once per node, line 4 can be executed at most once for each child of each node. But the number of children is exactly one less than the number of nodes. Finally the for loop iterates once per execution of line 4 plus once each time the loop ends. Thus, the total amount of work by this traversal strategy is constant per node. If there are N file names to be output, then the running time is O(N).
    Another common method of traversing a tree is the postorder traversal (后续遍历). The work at a node is performed after (post) its children are evaluated. Figure 4.8 represents the same directory structure as before,  with the numbers in parentheses representing the number of disk blocks taken up by each file.
    Since the directories are themselves files, they have size too. Suppose we would like to calculate the total number of blocks used by all the files in the tree. The most natural way to do this would be to find the number of blocks contained the subdirectories /usr/mark(30), /usr/alex(9), and /usr/bill(32). The total number of blocks is then the total in subdirectories(71) plus the one block used by /usr, for a total of 2. Refer the pseudocode method size infigure 4.9.



if the current object is not a directory, then size merely returns the number of blocks it uses in the current object. Otherwise, the number of blocks used by the directory s added to the number of blocks(recursively)found in all the children. figure 4.10 trace the size function (postorder)



4.2 Binary Trees







4.2.1 Implementation

An implementation for Binary Tree
template <typename Object>
struct BinaryNode
{
Object element; // the data in the node
BinaryNode *left; // left child
BinaryNode *right; // right child
};We also do not explicity draw nullptr links when refering to trees, because every binary tree with N nodes would requir N + 1 nullptr links(this could be shown by mathematical induction)
Binary Tree is not only used for searching but also be used to evaluate as compiler design, refer the example following:

4.2.2 An Example: Expression Trees



Figure 4.14 shows an example of an expression tree. The leaves of an expression tree are operands, such as constants or variable names, and the other nodes contain operators. This particular tree happens to be binary, because all the operators are binary, and although this is the simplest case, it is possible for nodes to have more than two children. It is also possible for a node to have only one child, as the case with the unary minus operator.     We can evaluate an expression tree, T, by applying the operator at the root to the values obtained by recursively evaluating the left and right subtrees. In our example, the left subtree evaluates to a + (b * c) and the right sub tree evaluates to ((d * e) + f) * g. The entire tree is ( a + (b * c)) + (((d * e) + f) * g).    An infix expression could be presented by the expression tree. The general strategy for traversal is left, node, right as the inorder traversal.(中序遍历)    If we change the traversal strategy to postorder traversal (mention in section 4.1 about calculate the size of the directory) And postfix expression will be obtained. such as a b c * + d e * f + g * +.    A third traversal strategy is to print out the operator first and then print left and right subtrees. the expression will be + + a * b c * + * d e f g , this is theprefix notation. and the traversal strategy ispreorder traversal. (mentioned in section 4.1 to print the structure of the directory tree)   In general, a prefix notation is obtained by preorder traversal(先序遍历 node, left, right).  postfix notation is obtained by postorder traversal (后续遍历 right, left, node). And the infix notation is obtained by inorder traversal (中序遍历 left , node, right)

Constructing an Expression Tree

Consider the algorithm to convert a postfix(post order traversal) expression into an expression tree. Since we already to have an algorithm to converto infix to postfix, we can generate expression trees from the two common types of input.Sketch progress:1) We read our expression(post fix) one symbol at a time. If the symbol is oprand, we create a one-node tree and push a pointer to it onto a stack.2) if the symbol is an operator , we pop to two trees T1 and T2 from the stack (T1 is popped first) and form a new tree whose root is the operator and whose left and right children point to T2 and T1, respectively. A pointer to this new tree is then pushed onto the stack.3) after we have process all of the operand /operator from the input postfix expression, we just pop the last Tree from the stack, which is the expression Tree required.As an example:input: a b + c d e + * *








4.3 The Search Tree ADT--Binary Search Trees

An important application of binary trees is their use in searching. Assume that each node in the tree stores an item, In our examples, we will assume, for simplicity, that these are integers, although arbitrarily complex items are easily handled in C++. We also assume that all the items are distinct,(we will discuss duplicates later).





A implementation of binary search tree, we defined an comparable object (support < ) and use the function object to support comparison. refer section 1.6.3The data member is a pointer to the root node, this member is nullptr for empty trees.The public funcitons use the general technique of calling private recursive functions. such as contains, insert, and remove etc.
A Binary search tree class skeleton
template <typename Comparable>
class BinarySearchTree
{

public:
BinarySearchTree();
BinarySearchTree(const BinarySearchTree &rhs);
BinarySearchTree(BinarySearchTree &&rhs);
~BinarySearchTree();

const Comparable &findMin() const;
const Comparable &findMax() const;
bool contains(const Comparable &x) const;
bool isEmpty() const;
void printTree(std::ostream &out = std::cout) const;

void makeEmpty();
void insert(const Comparable &x);
void insert(Comparable &&x);
void remove(const Comparable &x);

BinarySearchTree &operator=(const BinarySearchTree &rhs);
BinarySearchTree &operator=(BinarySearchTree &&rhs);

private:
struct BinaryNode
{
Comparable element;
BinaryNode *left;
BinaryNode *right;

BinaryNode(const Comparable &theElement, BinaryNode *lt, BinaryNode *rt)
: element{theElement}, left{lt}, right{rt}
{
}

BinaryNode(Comparable &&theElement, BinaryNode *lt, BinaryNode *rt)
: element{std::move(theElement)}, left{lt}, right{rt}
{
}
};

BinaryNode *root;   // pointer to the root node of the BinarySearchTree

// private member for recursive calling.
void insert(const Comparable &x, BinaryNode *&t);
void insert(Comparable &&x, BinaryNode *&t);
void remove(const Comparable &x, BinaryNode *&t);
BinaryNode *findMin(BinaryNode *t) const;
BinaryNode *findMax(BinaryNode *t) const;
bool contains(const Comparable &x, BinaryNode *t) const;
void makeEmpty(BinaryNode *&t);
void printTree(BinaryNode *t, std::ostream &out) const;
BinaryNode *clone(BinaryNode *t) const;
};
some of the private member functions use the technique of passing a pointer variable using call-by-reference. This alows the public member functions to pass a pointer to the root to the private recursive member functions. The recursive functions can then change the value of the root so that the root points to another node.
Some of the private methods.

4.3.1 contains

Notice that the recursion in contains is a tail recursive and it could be remove to implement by a while loop.but because the depth of binary tree is about logN so the space is only O(logN)Figure 4.19 show the trivial changes required to use a function object rather than requiring that the items be comparable. 


4.3.2 findMin and findMax

the two rountines return a pointer to the node containing the smallest and largest elements in the tree,.for findMin ,start at root and go left as long as there is a left child. the stopping point is the smallest element.for the findMax rountine go right until the stopping which is the max element.

4.3.3 insert

Figure 4.22 show the insert progress. To insert 5, we traverse the tree as through a contains were occuring. At the node with item 4, we need to go right, but there is no subtree, so 5 is not the tree, and this is the correct spot to place 5.Duplicates can be handled by keeping an extra space to the entire tree but is better than putting duplicate in the tree(which tends to make the tree very deep).


4.3.4 remove

The hardest operation is deletion. Once we have found the node to be deleted, we need to consider several possibilities.If the node is a leaf, it can be deleted immediaely. If the node has one child, the node can be deleted after its parent adjusts a link to by pass the node (refer figure 4.24)


The complicated case deals with a node with two children. The general strategy is to replace the data of this node with the smallest data of the right subtree(use find mix to get). then we delete the smallest data node from the right subtree recursively(which is now empty). The reason is that because the smallest node in the right subtree cannot have a left child. so the second remove is an easy one delete node. Refer figure 4.25.



4.3.5 Destructor and Copy Constructor



Finally, an implementation about the binarysearchtree refer:https://github.com/sesiria/Algs/blob/master/Lib/BinarySearchTree.h

4.3.6 Average-Case Analysis

Except the makeEmpty and copy, all opeartion is O(d) (d is the depth of the node containing the accessed item, in the case of remove ,this may be the replacement node in the two-child case).The average depth over all nodes in a tree is O(logN) on the assumption that all insertion sequences are equally likely.The sum of the depths of all nodes in a tree is known as the internal path length. we will now calculate the average internal path length of a binary search tree, where the average is taken over all possible insertion sequences into binary search trees.Proof:the average depth of a binary search tree is O(logN)


So the average running time of all operations (except make empty, copy) is O(logN), but this is not entirely true.The reason is for the deleteion, it is not clear that all binary search trees are equally likely. In particular, the deletion algorithm described above favors making the left subtrees deeper than the right, because we are always replacing a deleted node with a node from the right subtree. The exact effect of this strategy is still unknown, but it seems only to be a theoretical novelty.It has been shown that if we alternate insertions and deletionsΘ(N^2) times, then the trees will have an expected depth of Θ(√N) after a quarter-million random insert/remove pairs, the tree that was somewhat right-heavy in figure 4.29 looks decidely unbalanced(average depth = 12.51) in Figure 4.30.





We could try to eliminate the problem by randomly choosing between the smallest element in the right subtree and the largest in the left when replacing the deleted element. This apparently eleminates the bias and should keep the trees balanced, but nobody has actually proved this. In any event, this phenomenon apperas to be mostly a theoretical novely, because the effect does not show up at all for small trees, and, stranger still, if o(N^2) insert/remove  pairs are used, then the tree seems to gain balance!
The main point of the discussion  is for the "average" case(exception for extreme situation). In the absence of deletions, or when lazy deletion is used, we can conclude  that the average running times of the operations above are O(logN). Except for strange cases like the one discussed above, the result is very consistent with observed behavior.
If the input comes into a tree presorted, then a series of inserts will take quadratic time and give a very expensive implementation of a linklist, since the tree will consist only of node with no left children. One solution to he problem is to insist on an extra structural condition called balance: No node is allowed to get too deep.
There are quite a few general algorithms to implement balanced trees. Most are quite a bit more complicated than a standard binary search tree, and take longer on average for updates. But they provide protection against the embarrassingly simple cases. We will discuss one of the oldest forms of balanced search trees, the AVL tree.
A second method is to forgo the balance condition and allow the tree to be arbitrarily deep, but after every operation, a restructuring rule is applied that tends to make future operations efficient. These types of data structures are generally classified as self-adjusting. In the case of a binary search tree, we can no longer guarantee an O(logN) bound on any single operation but can show that any sequence of M operations takes total time O(MlogN) in the worst case. The data structure is called splay tree; its analysis is discussed in Chapter11.

4.4 AVL Trees

An AVL (Adelson-Velskii and Landis) tree is a binary search tree with a balance condition. It ensures that the depth of the tree is O(logN). The simples idea is to require that the left and right subtrees have the same height. As figure 4.31 shows, this idea does not fore the tree to be shallow.



Another balance condition would insist that every node must have left and right sub-trees of the same height. If the height of an empty subtree is defined to be -1(as is usual), then only perfectly balanced trees of 2^k - 1 nodes would satisfy this criterion. Thus, although this guarantees trees of small depth, the balance condition is too rigid to be useful and needs to be relaxed.


    AVL tree is identical to a binary search tree, except that for every node in the tree, the height of the left and right subtrees can differ by at most1.(The height of an empty tree is defined to be -1.) Refer Figure 4.32 the tree on the left is an AVL tree but the tree on the right is not. Height information is kept for each node(int the node structure).    It can be shown that the height of an AVL tree is at most roughly. 1.44log(n + 2) - 1.328, but in practice it is only slightly more than logN.


    As an example, the AVL tree of height 9 with fewest nodes(143) is shown in Figure 4.33. This tree has a left subtree an AVL tree of height of height 7 of minimum size.  The right subtree is an AVL tree of height 8 of minimum size.This tell us that the minimum of nodes S(h), in an AVL tree of height h is given by S(h) = S(h - 1) + S(h - 2) + 1. For S(0) = 1, S(1) = 2. The function S(h) is closely related to the Fibonacci numbers, from which the bound claimed above on the height of an AVL tree follows.   


suppose the node that must be rebalanced named a, there are four case that a violation may occurs:1) Insertion into the left subtree of left children of a.2) Insertion into the right subtree of left children of a.3) Insertion into the left subtree of right children of a.4) Insertion into the right subtree of right children of a.1),4) are symmetries could be fixed by an single rotation.2),3) could be fixed by and double rotation.Chapter 12 describle other balanced-tree method with an eye toward a more careful implementation.

4.4.1 Single Rotation

The case after insert some element into subTree x, the node k2 violate the AVL property. the left subtree levels deepr 2 than the right subtree of k2.This is the only case that cause the k2 to be the node violate the AVL property. Because if Y has the same level as the new X, then k1 will become violation before the insertion which is impossible. And if Y has the same level as Z because K1 will also become violation before the insertion.


To ideally rebalance the tree, we would likely to move X up a level and Z down a level. To do this, we rearrange nodes into an equivalent tree as shown in the second part of Figure 4.34. Because k1 < k2, then k2 become the right child of the new root k1.  X and Z remain as the left child of k1 and right child of k2. Respectively, Subtree Y, because k1 < Y < k2. Then we set Y to be the right child of k2 to observe that k1 < Y < k2.As a result of this work, which requires only a few pointer changes, we have another binary search tree that is an AVL tree. This happens because X move up one level, Y stays at the same level, and Z moves down one level. K2 and K1 not only satisfy the AVL requirements, but they also have subtrees that are exactly the same hight. Furthermore, the new height of the entire subtree is exactly the same as the height of the original subtree prior to the insertion that cause X to grow. Thus no further updating of heights on the path to the root is needed, and consequently no further rotation are needed.


Case 4 represents a symmetric case. Figure 4.36 show how a single rotation is applied.



A case study to insert 3, 2, 1 and 4 through 7 in a sequential order.The first problem occurs when it is time to insert item 1 because the AVL property is violated at the root. We perform a single rotation between the root and its left child to fix the problem. Refer:



Next we insert 4, which causes no problem, but the insertion of 5 creates a violation at node 3 that is fixed by a single rotation. Besides the local change caused by the rotation, the programmer must remember that the rest of the tree has to be informed of this change. Here is means that 2's right child must be rest to link to 4 instead of 3. Forgetting to do so is easy and would destroy the tree(4 would be inaccessible).


Next we insert 6. This causes a balance problem at the root, since its left subtree is of height 0 and its right right subtree would be height 2. Therefore, we perform a single rotation at the root between 2 and 4.


The rotation is performed by making 2 a child of 4 and 4's original left subtree the new right subtree of 2. Every item in this subtree must lie between 2 and 4. The next item we insert is 7, which causes another rotation:


4.4.2 Double Rotation

Single Rotation does not work for case 2 or 3 refer Figure 4.37. The problem is that the subtree Y is too deep. The double rotation that will solve the problem refer Figure 4.38


The fact that subtree Y in figure 4.37 had an item inserted into it guarantees that it is no empty. Thus, we may assume that it has a root and two subtrees. Consequently, the tree may be viewed as four subtrees connected by three nodes. As the diagram suggests, exactly one of tree B or C is two level deeper than D, but we cannot be sure which one. it turns out not to matter; in Figure 4.38 both B and C are drawn at 1(1/2) levels below D.
To rebalance, to place k2 as the new root. (this force k1 to be k2's left child and k3 to be its right child)Figure 4.39 show the symmetric case 3 can also be fixed by a double rotation.

Continue the example by inserting 10 through 16 in reverse order(that is 16, 15, 14, 13, 12, 11, 10). and follow by 8 and 9.



Inserting 16 is easy, since it does not destroy the balance property, but inserting 15 causes a height balance at node 7. This is the case 3, which is solved by a right-left double rotation. (In this case it is involved 7, 16, 15 and k1 is the node with the item 7, k3 is the node with item 16 and k2 is the node item 15 subtree A, B, C and D are empty).




If 13 is now inserted, there is an imbalance at the root. Since 13 is not between 4 and 7, we know that the single rotation will work.


Insertion of 12 will also require a single rotation:


To insert 11, a single rotation needs to be performed, and the same is true for the subsequent insertion of  10.We insert 8 without a rotation, creating an almost perfectly balanced tree:


Finally, we will insert 9 to show the symmetric case of the double  rotation.


An implementation for AVL tree refer:https://github.com/sesiria/Algs/blob/master/Lib/AVLTree.h

4.5 Splay Trees(伸展树,分裂树)

    Splay Tree guarantees that any M consecutive tree operations starting from an empty tree take at most O(MlogN) time. Although this does not preclude the possibility that any single operation might take Θ(N) time, and thus the bound is not as strong as an O(logN) worst-case bound per operations, the net effect it the same:There are no bad input sequences. Generally, when a sequence of M operations has total worst-case running time of O(Mf(N)), we say that the amortized(平摊) running time is O(f(N)). Thus, a splay tree has an O(logN) amortized cost per operation.    Splay trees are based on the fact that the O(N) worst-case time per operation for binary search trees is not bad, as long as it occurs relatively infrequently. A search tree data structure with O(N) worst-case time, but a guarantee of at most O(MlogN) for any M consecutive operations, is certainly satisfactory, because there are not bad sequences.    if any particular operaton is allowed to have an O(N) worst-case time bound, and we still want an O(logN) amortized time bound, then it is clear that whenever a node is accessed, it must be moved. Otherwise, once we find a deep node, we could keep performing accesses on it. If the node does not change location, and each access cost Θ(N), then a sequence of M access will cost Θ(M*N).    The basic idea of splay tree is that after a node is accessed, it is pushed to the root by a series of AVL tree rotations. Notice that if a node is deep, there are many nodes on the path that are also relative deep, and by restructuring we can make furture accesses cheaper on all these nodes. Thus, if the node is unduly deep, then this restructuring will side effect the balance of the tree. In many applications, when a node is accessed, it is likely to be accessed again in the near future. Studies have shown that this happens much more offen than one would expect. Splay trees also don not require the maintenance of height or balance information, thus saving space and simpifying the code to so extent.

4.5.1 A Simple Idea(That Does Not Work)

One performing the restructing described above to perform single rotations, bottom up. It means that we rotate every node on the access path with its parent. As an example, access on k1 in the following tree:






The rotations effect of pushing k1 all the way to the root, so that future accesses on k1 are easy. But this strategy push another node (k3) almost as deep as k1 used to be. An access on that node will then push another node deep, and so on. So this does improve the access performance for other nodes from the original access path. It could be proved that this strategy for M operations requiring Ω(M*N) time. Suppose build an tree with sequence 1, 2, 3, ...N. then the cost is O(N). Then we try to access the tree with the key 1, 2, 3, 4... N. The access to key 1 will take N units of time(for N times of rotations) , then an access of node with key 2 takes N units of time, then key 3 takes N -1 units, and so on. The total for accessing all the key in order is 

 After. they are accessed, the tree reverts to its original state, and we can repeat the sequence.

4.5.2 Splaying

The splaying strategy is similar to the rotation data about. except that we are little more selective about how rotations are performed. We still rotate bottom up along the access path.Let X to be a node on the access path at which we are rotating. If the parent of X is the root of the tree, we merely rotate X and the root. This is the last rotation alog the access path. Other wise ,X has both a parent (P) and a grandparent(G), and there are two cases, plus symmetries, to consider.The first case is the zig-xag case (see figure 4.48). Here X is a right child and P is a left child(or vice versa). If this is the case, we perform a double rotation ,exactly like AVL double rotation.


If we have a zig-zig case: X and P are both left children(or, in the symmetric case, both right children). In that case, we transform the tree on the left of Figure 4.49 to the tree on the right.
Consider the tree from the last example, with a contains on k1:






takes N units, the access on the node with item 2 will only take about N/2 units instead of N units; there are no nodes quite as deep as before.An access on the node with item 2 will bring nodes to within N/4 of the root, and this is repeaded until the depth becomes roughly logN. Figure 4.51 to 4.59 show the result of accessing items 1 through 9 in a 32-node tree that originally contains only left children.(A rather complicated proof shows that for this example, the N accesses take a total of O(N) time.)










These figures highlight the fundamental and crucial property of splay trees. When access paths are long, thus leading to the longer-than normal search time, the rotations tend to be good for furture operations. When accesses are cheap, the rotations are not as good and can be bad. The extreme case is the initial tree formed by the insertions, all insertions wre constant-time leading to a bad initial tree. Then we run a head of restructing to fix the tree to become a nearly balance tree. The main theorem, which will prove in chapter 11, is that we never fall behind a pace of O(logN) per operation: We are always on schedule, even through there are occasionally bad operations.
We can perform deletion by accessing the node to be deleted. This put the node at the root. If it is deleted, we get two subtrees Tl and Tr(left and right). If we find the largest element in TL(which is easy), then is element is rotated to the root of Tl, and Tl will now have a root with no right child. we can finish the deletion by making Tr the right child.
The analysis of splay trees is difficult, because it must take into account the ever-changing structure of the tree. On the other hand, splay trees are much simpler to program thant most balanced search trees, since there are fewer cases to consider and no balance information to maintain. refer chapter 12(an implementation of splay tree)
code for bottom-up splay tree: https://github.com/sesiria/Algs/blob/master/Lib/SplayTree.h

4.6 Tree Traversals (Revisited)

Because the ordering information in a binary search tree, it is simple to list all the items in sorted order.By recursive routine printTree (Inorder traversal)It prints the left subtree and the current node, then the right subtree. The total running time is O(N). Because there is constant work being performed at every node in the tree. each node is visited once, and the work performed at each node is testing against nullptr.Sometimes we need to process both subtrees first before we can process a node. For instance, to compute the height of a node, we need to known the height of the subtrees first. refer  figure 4.61. This is the postorder traversal and the time complexity is O(N) because constant worktime per node.


The third popular traversal scheme is preorder traversal. The node is processed before the children,. for example, if you wanted to label each node with its depth.The common idea in all of these routines is that you handle the nullptr case first and then the rest. Notice the lack of extraneous variables. There routines pass only the pointer to the node that roots the subtree, and do not declare or pass any extra variables. The more compact the code, the less likely that a silly bug will turn up.A fourth, less often used, traversal is level-order traversal(层序遍历). In a level-order traversal, all nodes at depth d are processed before any node at depth d + 1. Level-order traversal differs from the other traversals in that it is not done recursively; a queue is used, instead of the implied stack of recursion.

4.7 B-Trees

    If we have a massive of data which can't be load into the main memory at the same time. We need to store it in the harddisk. But due to the the access rate of the hardddisk is limited(a 7200rpm hdd could be access about 120 tims/s). An disk access is expensive,     For the typical search tree performs on the disk: suppose we want to access the driving records for citizen in the state of florida. We assume that we have 10,000,000 items, that each key is 32 bytes(representing a name), that a record is 256bytes. We assume that it does not fit the main memory and that we are 1 of 20 users on a system (so we have 1/20 of the resources). Thus , in 1 sec we can execute many millions of instructions or perform six disk access.The unbalanced binary search tree is a disater, in the worst case, it has linear depth and thus could require 10,000,000 disk access. On averare, a sucess ful search would require 1.38 logN disk access, (about log 10,000,000 = 24) an average search would require 32 disk accesses, or 5 sec. In a typical randomly constructed tree, we would expected that a few nodes are three times deeper; these would require about 100 disk access, or 16 sec. An AVL tree is somewhat better. The worst case of 1.44logN is unlikely to occur and the typical case is very close to logN. Thus AVL tree would use about 25 disk accesses on average, requiring 4 sec.    So we need a data strucute to optimize the acces times even it will be complicated. It means that we need a data structure with low depth for searching. We need an M-ary search tree alows M-way branching. Also, We need some restriction to prevent it to degenerate into a binary search tree or even into a linear linklist.    One way to implement this is to use a B-tree. The basic B-tree is described here. Many variations and improvements are known, and an implementation is somewhat complex because there are quite a few cases. However, it is easy to see that , in principle, a B-tree guarantees only a few disk accesses.




    An example of B-tree of order 5 (5-arry) is shown if figure 4.63.  All nonleaf nodes have between 3~5 children(and thus between two and four keys); the root could possibly have only two children. Here , we have the L = 5. It happens that L and M are the same in this example, but this is not necessary. Since L is 5, each leaf has between three and five data items. Requiring nodes to be half full guarantees that the B-tree does not degenerate into a simple binary tree. 
    Each node represents a disk block, so we choose M and L on the basis of the size of the items that are being stored. i.e. Suppoes on block holds 8,192 bytes. In our Florida example, each key use 32bytes. In a B-tree of order M, we would have M-1 keys, for a total of (32M-32)bytes, plus M branches. Since each branch is essentially a number of another disk block, we can assume that a branch is 4bytes. Thus the branches use (4M) bytes. The total memory requirement for a nonleaf node is thus (36M - 32) bytes. The largest value of M for which is no more than 8192 is 228.(solve for 36M - 32 <= 8192).   Thus we would choose M = 228. Since each data record is 256bytes, we would be able to fit 32 records in a block. (8192 / 256 = 32) Thus we choose L = 32. We are guaranteed that each leaf has between 16 and 32 data records and that each internal node(except the root) branches in at least 114 ways. Since there are 10,000,000 records, there are , at most, 625,000 leaves(Suppose the worst case, each leave contain only half data. that is L/2 = 16, then 10,000,000 / 16 = 625,000). Consequently, in the worst case, leaves would be on level 4(The worst case that we only use half nonleaf node of 114ways, and all leaf only contain half data of 16, then we will need a tree with depth 4). In more concrete terms,  the worst-case number of accesses is given by approximately 

 or take 1. (For example , the root and the next level could be cached in main memory, so that over the long run, disk accesses would be needed for level 3 and deeper.)
The remain is how to add and remove items from the B-tree.For insertion. Suppose we want to insert 57 into the B-tree in Figure 4.63. A search down the tree reveals that it is not already in the tree. We can add it to the leaf as a fifth item. Note that we may have to reorganize all the data in the leaf to do this. However, the cost of doing this is negligible when compared to that of the disk access, which is this case also include a disk write.Of course, that was relatively painless, because the leaf was not already full. Suppose we now want to insert 55. Figure 4.64 shows a problem: The leaf where 55 wants to go is already full.


The solution is : Since we now have L + 1 items, we split them into two leaves, both guaranteed to have the minimum number of data records needed.  we form two leaves with three items each. Two disk access required to write these leaves, and a third disk access is required to update the parent. Note that in the parent, both keys and branches change, but they do so in a controlled way that is easily calculated. The resulting B-tree is shown in Figure 4.65.


Although splitting nodes is time-consuming because it requires at least two additional disk writes, it is relatively rare occurence. If L is 32, for example, then when a node is split, two leaves with 16 and 17 items, respectively, are created. For the leaf with 17 items, we can perform 15 more insertions without another split. Put another way, for every split, there are roughly L/2 nonsplits.Suppose we insert 40 into the B-tree in figure 4.65 . We must split the leaf containing the keys 35 through 39, and now 40, into two leaves. But doing this would give the parent six children, and it is allowed only five. The soluton is to split the parent. The result of this is shown in Figure 4.66.


When the parent is split, we must update the values of the keys and also the parent's parent, thus incurring an additional two disk writes(so this insertion cost five disk writes. split into two leaf nodes, update two parents node, and update the root node). However,  the keys change in a very controlled manner, although the code is certainly not simiple because of a host of cases.
When a nonleaf node is split, as is the case here, its parent gains a child. What if the parent already has reached its limit of children? Then we continue splitting nodes up the tree until either we find a parent that does not need to be split or we reach the root. If we split the root, then we have two roots. Obviously, this is unacceptable, but we ranted the special two-child minimum exemption. It also is the only way that a B-tree gains height. Splitting all the way up to the root is an exceptionally rare event. This is because a tree with four levels indicates that the root has been split three times throughout the entire sequence of insertions(assuming no deletions have occurred). In fact, splitting any nonleaf node is also quite rare.
There are other ways to handle the overflowing of children. One technique is to put a child up for adoption should a neighbor have room. To insert 29 into the B-tree in figure 4.66, we could make room by moving 32 to the next leaf. This technique requires a modification of the parent, because the keys are affected. however, it tends to keep nodes fuller and saves space in the long run.
We can perform deletion by finding the item that needs to be removed and then removing it. The problem is that if the leaf it was in had the minimum number of data items, then it is now below the minimum. We can rectify this situation by adopting a neighboring item, if the neighbour is not itself at its minimum. If it is , then we can combine with the neighbor to form a full leaf. Unfortunately, this means the parent has lost a child. If this causes the parent to fall below its minimum, then it follow the same strategy. This process could percolate all the way up to the root. The root can not have just one child(even if this were allowed, it would be silly). IF a root is left with one child as a result of the adoption process, then we remove the root and make its child the new root of the tree. This is the only case for B-tree to lose height. Suppose we want to remove 99 from the B-tree in Figure 4.66. Since the leaf has only two items and its neighbor is already at its minimum of three, we combine the items into a new leaf of five items.  As a result, the parent has only two children. However, it can adopt from a neighbor, because the neighbor has four children. As a result, both have three children. The result is shown in Figure 4.67.


A more practicable B-tree implementation is refer CLRS.
And Implementation for B-Tree from CLRS: https://github.com/sesiria/Algs/blob/master/Lib/BTree.h


4.8 Sets and Maps in the Standard Library

The STL container std::vector and std::list are inefficient for searching. Consequently, the STL provides two additional containers, std::set and set::map. that guarantee logarithmic cost for basic operations such as insertion, deletion, and searching.

4.8.1 Sets

The set is an ordered container that does not allow duplicates. Many of the idioms used to access items in vector and list also work for a set. Specially, nested in the set are iterator and const_iterator types that allow traversal of the set, and several methods from vector and list are identically named in set, including begin, end, size and empty. The print function discussed in figure 3.6 also work for set.    The unique operations required by the set are the insert, remove and a basic search(efficiently).     insert, return an iterator that represents where x is when insert returns. This iterator represents either the newly inserted item or the existing item that caused the insert to fail, and it is useful, and knowing the position of the item can make removing it more efficient by avoiding the search and getting directly to the node containing the item.    The STL defines a class template called pair that is little more than a struct with member first and second to access the two items in the pair. There are two different insert routines:    
std::pair<iterator, bool> insert(const Object &x);
std::pair<iterator, bool> insert(iterator hint, const Object &x);
For the one parameter insert. it just insert an element to the set container.For the two parameter insert, allows the specification of a hint, which presents the position where x should go. If the hint is accurate, the insertion is fast, ofen O(1). If not ,the insertion is done using the normal insertion algorithm and perform comparably with the one-parameter insert. The code below will faster when using the two parameter insert than one parameter insert:
std::set<int> s;
for (int i = 0; i < 1000000; ++i)
s.insert(s.end(), i);
There are several versions of erase:
int erase (const Object &x);
iterator erase(iterator itr);
iterator erase(iterator start, iterator end);

The first one-parameter erase remove x and returns the number of items actually removed. Which is either 0 or 1.The second one-parameter erase removes the object at the position given by the ierator, return an iterator representing the element that followed itr immediately prior to the call to erase, and invalidates itr, which becomes stale.The third two-parameter erase removing all the items starting at start, up to but not including the item at end.
For searching, rather than a contains routine that returns a boolean variable, the set provides a find foutine that returns a iterator representing the location of the item(or the endmarker if the serach fails).
iterator find(const Object & x) const;


By default, ordering uses the less<Object> function object, which itself is implemented by invoking operator < for the Object. An alternative ordering can be specified by instantiating the set template with a function object type. For example ,we can create a set that stores string objects, ignoring case distinctions by using the CaseInsensitiveCompare function Object codeded in figure 1.25. such as:
#include <iostream>
#include <set>
#include <string>
#include <string.h>

class CaseInsenstiveCompare
{
public:
bool operator()(const std::string & lhs, const std::string & rhs) const
{ return _stricmp(lhs.c_str(), rhs.c_str()) < 0;}
};

int main(int argc, char **argv)
{
std::set<std::string, CaseInsenstiveCompare> s;
s.insert("Hello");
s.insert("HeLLo");
std::cout << "The size is :" << s.size() << std::endl;
return 0;
}

4.8.2 Maps

A map is used to store a collection of ordered entries that consists of keys and their values. Keys must be unique, but serveral keys can map to the same values.The map behaves like a set instantiated with a pair, whose comparison function refers only to the key. The map support begin, end, size, and empty, but he underlying iterator is a key-value pair. for an iterator itr, *itr is of type par<KeyType, ValueType>. The map support insert, find , and erase.For insert, one must provide a pair<keyType, ValueType> object. Although find requires only a key, the iterator it returns references a pair. Using only these operations is ofen not worthwhile because the syntactic baggage can be expensive.map provide array-indexing operator is overloaded:
ValueType & operator[] (const KeyType & key);

If the key is present in the map, a reference to the corresponding value is returned. If key is not present in the map, it is inserted with a default value into the map and then a reference to the inserted default value is returned. the default value is obtained by applying a zero-parameter constructor or is zero for the primitive types. This semantics do not allow for accessor version of operator [], So oprator [] cannot be used on a map that is const. If a map is passed by constant reference, inside the routine, operator[] is unusable.Figure 4.68 illustrate two techniques to access items in a map.


If it is important to distinguish between items that are in the map and those not in the map use the method from line 7.

4.8.3 Implementation of set and map

C++  requires that set and map support the basic insert, erase, and find operations in logarithmic worst-case time. Consequently, the underlying implementation is a balanced binary search tree. Typically, an AVL tree is not used; instead, top-down red-black trees,(discuss in chapter 12.2) are often used.    An important issue in implementing set and map is providing for the iterator classes. Of course, internally, the iterator maintains a pointer to the "current" node in the iteration. The hard part is efficiently advancing to the next node. There are several possible solutions, some of which are listed here:


Implementation for Set: https://github.com/sesiria/Algs/blob/master/cp4/ex4_11.cpp
Map: https://github.com/sesiria/Algs/blob/master/cp4/ex4_12.cpp

4.8.4 An Example That Uses Several Maps



The most straightforward strategy is to use a map in which the keys are words and the values are vectors containing the words that can be changed from the key with a on-character subsitution. Refer the code below:
void printHighChangeables(const std::map<std::string, std::vector<std::string>> &adjacentWords,
int minWords = 15)
{
for( auto & entry : adjacentWords)
{
const std::vector<std::string> &words = entry.second;

if(words.size() >= minWords)
{
std::cout << entry.first << " (" << words.size() << "):";
for(auto & str : words)
std::cout << " " << str;
std::cout << std::endl;
}
}
}

The main issue is how to construct the map from an array that contains the 89,000 words. The following routine is straightforward function to test if two words are identical except for a one-character substituion.
/**
* Returns true if word1 and word2 are the same length and differ in only one character.
*/
bool oneCharOff(const std::string & word1, const std::string & word2)
{
if(word1.length() != word2.length())
return false;

int diffs = 0;

for (int i = 0; i < word1.length(); ++i)
if(word1[i] != word2[i])
if(++diffs > 1)
return false;
return diffs== 1;
}
Then we use this function for map construction, which is a brute-force test of all pairs of words. The algorithm is as the following:
/**
* Computes a map in which the keys are words an d values are vectors of words
* that differ in only one character from the corresponding key.
* Uses a quadratic algorithm. O(N^2)
* this routine run in 1.5 minutes on an 89,000 word dictionary.
*/
std::map<std::string, std::vector<std::string>>
computeAdjacentWords(const std::vector<std::string> &words)
{
std::map<std::string, std::vector<std::string>> adjWords;

for (int i = 0; i < words.size(); ++i)
for (int j = i + 1; j < words.size(); ++j)
if (oneCharOff(words[i], words[j]))
{
adjWords[words[i]].push_back(words[j]);
adjWords[words[j]].push_back(words[i]);
}
return adjWords;
}

If we found a pair of words that differ in only one character, we can update the map for each as index. refer the inner loop code of computeAdjacentWods. The idiom we are using the adjWords[str] represents the vector of words that are identical to str, except for one character. If we have previously seen str, then it is in the map, and we need only add the new word to the vector in the map, and we do this by calling push_back. if we have never seen str before, then the act of using operator[] places it in the map, with a vector of size 0, and returns this vector, so push_back updates the vector to the size 1. ( a super-slick idiom for maintaining a map in which the value is a collection).    The problem with is algorithm is that it takes 97 seconds . An obvious improvement is to avoid comparing words of different length. We can do this by grouping words by their length, and then running the previous algorithm on each of separate groups.    We use a second map! the key is an integer representing a word length, and the value is a collectoion of all the words of that length. we can use a vector to store all collection, and the same idiom applies. The code is following:
/**
* Computes a map in which the keys are words an d values are vectors of words
* that differ in only one character from the corresponding key.
* Uses a quadratic algorithm. O(N^2) but speeds things up a little by
* maintaining an additional map that groups words by their length.
* this routine run in 18 seconds on an 89,000 word dictionary.
*/
std::map<std::string, std::vector<std::string>>
computeAdjacentWords1(const std::vector<std::string> &words)
{
std::map<std::string, std::vector<std::string>> adjWords;
std::map<int, std::vector<std::string>> wordsByLength;

// Group the words by their length
for (auto &thisWord : words)
wordsByLength[thisWord.length()].push_back(thisWord);

// work on each group separately
for (auto &entry : wordsByLength)
{
const std::vector<std::string> &groupWords = entry.second;
for (int i = 0; i < groupWords.size(); ++i)
for (int j = i + 1; j < groupWords.size(); ++j)
if (oneCharOff(words[i], words[j]))
{
adjWords[words[i]].push_back(words[j]);
adjWords[words[j]].push_back(words[i]);
}
}
return adjWords;
}
The second implementation runs 18 seconds, about six times fast. 




The following is an implementation of the algorithm. The running time improves to two seconds. it is interesting that although the use of the additional maps makes the algorithm faster, and the syntax is relative clean, the code makes no use the the fact that the keys of the map are maintained in sorted order.
/**
* Computes a map in which the keys are words an d values are vectors of words
* that differ in only one character from the corresponding key.
* Uses a efficient algorithm that is O(NlogN) with a map
* this routine run in 2 seconds on an 89,000 word dictionary.
*/
std::map<std::string, std::vector<std::string>>
computeAdjacentWords2(const std::vector<std::string> &words)
{
std::map<std::string, std::vector<std::string>> adjWords;
std::map<int, std::vector<std::string>> wordsByLength;

// Group the words by their length
for (auto &thisWord : words)
wordsByLength[thisWord.length()].push_back(thisWord);

// work on each group separately
for (auto &entry : wordsByLength)
{
const std::vector<std::string> &groupWords = entry.second;
int groupNum = entry.first;

// Work on each position in each group
for (int i = 0; i < groupNum; ++i)
{
// Remove one character in specified position, computing representative.
// Words with same representatives are adjacent; so populate a map...
std::map<std::string, std::vector<std::string>> repToWord;
for (auto &str : groupWords)
{
std::string rep = str;
rep.erase(i, 1);
repToWord[rep].push_back(str);
}

// and then look for map values with more than one string
for (auto &entry : repToWord)
{
const std::vector<std::string> &clique = entry.second;
if (clique.size() >= 2)
for (int p = 2; p < clique.size(); ++p)
for (int q = p + 1; q < clique.size(); ++q)
{
adjWords[clique[p]].push_back(words[q]);
adjWords[words[q]].push_back(words[p]);
}
}
}
}
return adjWords;
}
As such, it is possible that a data structure that supports the map operation but does not guarantee sorted order can perform better, since it is being asked to do less. Chapter 5 explores this possibility and discusses the ideas  behind the alternative map implementation that C++11 adds to the standard library, known as unordered_map. And unordered map reduce the running time of implementation from 2 sec to 1.5 sec.

Summary



内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  数据结构