您的位置:首页 > 数据库 > Oracle

Oracle基础学习总结之数据库与实例

2013-12-20 15:20 585 查看
1.  Problem Definition of Clustering:

    Informal goal: Given n "points" [Web pages, images, genome fragments, etc.] classify into "coherent groups" -- cluster 

    Assumptions:

        (1) As input, given a (dis)similarity measure -- a distance d(p , q) between each point pair.

        (2) Symmetric [i.e., d(p , q) = d(q , p)] (Examples: Euclidean distance, genome similarity, etc)

    Same cluster ==> "nearby"

 

2.  Max-Spacing k-Clusterings

    k-clustering : the # of desired clusters is k

    separated pair : Call points p & q separated if they're assigned to dierent clusters.

    Spacing : The spacing of a k-clustering is min (separated p,q){ d(p , q) }. (The bigger the better)

   Max-Spacing k-Clusterings problem : Given a distance measure d and k, compute the k-clustering with maximum spacing.

 

3.  A Greedy Algorithm

    --  Initially, each point in a separate cluster

    --  Repeat until only k clusters:

        -- Let p , q = closest pair of separated points (determines the current spacing)

        -- Merge the clusters containing p & q into a single cluster.

 

    Note: Just like Kruskal's MST algorithm, but stopped early.

 

4.  Correctness of Greedy Clustering

    -- Let C1, ... , Ck = greedy clustering with spacing S. Let C1', ... , Ck' = arbitrary other clustering.

       Need to show : spacing of C1', ... , Ck' <= S

    -- Case 1: Ci' are the same as the Ci (maybe after renaming) ==> has the same spacing S.

    -- Case 2: Otherwise, can find a point pair p , q such that:

            (A) p , q in the same greedy cluster Ci

            (B) p , q in different clusters Ci'

    -- Easy case: If p , q directly merged at some point in Ci, then S >= d(p , q)  (Distance between merged point pairs only goes up) == > S >= spacing of C1', ... , Ck' ( since p, q are separated )

    -- Tricky case: p , q "indirectly merged" through multiple direct merges. Let p, a1, ... al, q be the path of direct greedy merges connecting p & q. Since p in Ci' and q not in Ci' ==> exists consecutive pair aj , aj+1 with aj in Ci' and aj+1 not in Ci'  ==> S >= d(aj , aj+1) >= Spacing of C1', ... , Ck'



 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: