您的位置：首页 > 其它

elasticsearch之modeling your data（not flat）--Parent-child relationship

2014-11-15 20:44 239 查看

parent-child relationship跟nested objects在本质上是相似的，都是一个实体跟另一个实体相关联。区别在于，nested objects中的相关实体在一个document中，而parent-chlld relationship中的实体是完全分离的。可以用一个实体关联多个相关实体，是一对多的关系，跟nested object相比，优势有：第一：parent object可以单独更新，而无需reindex the children。第二：child document可以单独更新，删除，添加，不会影响parent 和其他child。这尤其适用于child数目很多，而且更新频繁的场景。第三那：查询结果可以单独返回child document。elasticsearch在parent和child之间维护了一个map，由于这个map的作用，查询时的join操作非常迅速。但是这也产生了一个局限：parent和他所有的child必须在同一个shard上，不能跨shard。1：parent-child mapping为了建立父子关系，需要表明那一个是父类型哪一个是子类型。必须在索引创建的指定或者用updata-mapping api在子类型还未创建之前去更新设置。假设我们有一个公司数据，公司在不同城市有不同的分部，每一个分部都有相关的员工信息。现在我们要搜索分部、单独的员工、为某一个分部工作的员工，在这种情形下，nested model不适合了。当然我们可以采用application-side-joins或者data denormalization来实现，但是这个地方我们用parent-child来说明实现方法。我们首先要告诉elasticsearch的是employee的父亲类型是brance。因此我们设定mapping如下：

PUT /company
{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch" 
      }
    }
  }
}

上边表明了employee的parent类型是branch。2：indexing parent and childrenindex parent跟普通的index data没有什么区别，parent无需知道children信息：

POST /company/branch/_bulk
{ "index": { "_id": "london" }}
{ "name": "London Westminster", "city": "London", "country": "UK" }
{ "index": { "_id": "liverpool" }}
{ "name": "Liverpool Central", "city": "Liverpool", "country": "UK" }
{ "index": { "_id": "paris" }}
{ "name": "Champs Élysées", "city": "Paris", "country": "France" }

index child过程中，必须指定child对应的parent的id，来维持父子关系：

PUT /company/employee/1?parent=london 
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}

上边表明employee是在伦敦分部工作的。parent id 有两个用途：建立了父子之间的关系；确保父子存储在同一个分片上。elasticsearch将document定位到shard的机制中：

shard = hash(routing) % number_of_primary_shards

其中routing value默认是采用_id信息。

当parent ID指定之后，将采用parent ID作为routing value，而不采用默认的_id信息。也就是说父子用同一个routing value，所以可以位于同一个shard上。

parent id在所有single-request请求中都需要被指定：当用get请求检索child document，或者是index，delete，update一个child document。跟search request需要检索所有shard的机制不一样，上述这些请求只会去检索存储对应documen的shard。，如果parend id没有被指定，则请求有可能被定位到错误的shard。当用bulk api时，parend id也需要被指定：

POST /company/employee/_bulk
{ "index": { "_id": 2, "parent": "london" }}
{ "name": "Mark Thomas", "dob": "1982-05-16", "hobby": "diving" }
{ "index": { "_id": 3, "parent": "liverpool" }}
{ "name": "Barry Smith", "dob": "1979-04-01", "hobby": "hiking" }
{ "index": { "_id": 4, "parent": "paris" }}
{ "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" }

warn:如果想改变一个child document对应的parent value的值（parend id），仅仅改变child document对应的值是不可以的，因为这样可能会导致跟parent document不在同一个shard上，因此正确的做法是先完整删除the old child，然后在index the new child。

3：finding parents by their children

has_child类型的query和filter用于根据child信息查询parent信息。比如：我们可以查询哪些部门存在employee晚于1980出生的信息：

GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "range": {
          "dob": {
            "gte": "1980-01-01"
          }
        }
      }
    }
  }
}

跟nested object类似，has_child会匹配到很多child document，没一个都有一个score值，score_mode用于控制这些分散的score如何集成为一个单一的score值（基于parent document）。默认是none（忽略child score，统一赋值1.0），其他设定avg，min，max，sum。

以下查询将返回london and liverpool，london将会得到一个较高的score，因为Alice Smith的匹配度更高一些

GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type":       "employee",
      "score_mode": "max"
      "query": {
        "match": {
          "name": "Alice Smith"
        }
      }
    }
  }
}

tip：score_mode的默认选项none，速度会比其他选项更快一些。因为es不需要计算没一个child document的score值，统一设置为1.0

has_child query and filter同样有两个参数：min_children和max_children。满足匹配到的最小/最大的child document的parent才会返回。

以下只返回满足至少两个employee的部门信息：

GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type":         "employee",
      "min_children": 2, 
      "query": {
        "match_all": {}
      }
    }
  }
}

带有min/max_children参数的has_child类型的query的性能跟不携带这两个参数并启用score机制的性能差不多。

has_child filter工作机制跟query几乎一样，只是不支持score_mode参数。

4：finding children  by their parents

nested query只能返回root document作为结果集，而parent-child是相互独立的，每一个都能单独查询。has_child是根据child信息返回parent信息，而has_parent是根据parent信息返回child信息。

下边返回：在UK工作的employee

GET /company/employee/_search
{
  "query": {
    "has_parent": {
      "type": "branch", 
      "query": {
        "match": {
          "country": "UK"
        }
      }
    }
  }
}

has_parent也支持score_mode，但是只有俩值：none/score。因为没一个child只能有一个parent，所以没有必要将多个score值统一为一个score值，所以选项就变成了要么启用score（score），要么不启用score（none：default）。

has_parent filter机制跟query，只是不支持score mode。

5：children aggregation

parent-child支持children aggregation，但是不支持parent aggregation（类似与reverse_nested）.

以下根据contry来统计employee最喜欢的hobby：

GET /company/branch/_search?search_type=count
{
  "aggs": {
    "country": {
      "terms": { 


        "field": "country"
      },
      "aggs": {
        "employees": {
          "children": { 


            "type": "employee"
          },
          "aggs": {
            "hobby": {
              "terms": { 


                "field": "employee.hobby"
              }
            }
          }
        }
      }
    }
  }
}

（1）：根据branch的country字段bucket

（2）：children aggregation根据employee类型跟parent进行join

（3）：根据employee.hobby字段进行bucket

6：grandparents and grandchildren

parent-child关系可以拓展到grandparent和grandchildren级别。但是让然需要区别各个genaration需要在同一个分片上。

mapping：

PUT /company
{
  "mappings": {
    "country": {},
    "branch": {
      "_parent": {
        "type": "country" 
      }
    },
    "employee": {
      "_parent": {
        "type": "branch" 
      }
    }
  }
}

indexing data：

POST /company/country/_bulk
{ "index": { "_id": "uk" }}
{ "name": "UK" }
{ "index": { "_id": "france" }}
{ "name": "France" }

POST /company/branch/_bulk
{ "index": { "_id": "london", "parent": "uk" }}
{ "name": "London Westmintster" }
{ "index": { "_id": "liverpool", "parent": "uk" }}
{ "name": "Liverpool Central" }
{ "index": { "_id": "paris", "parent": "france" }}
{ "name": "Champs Élysées" }

以上london会根据parent为uk跟parent落在同一个shard上。

PUT /company/employee/1?parent=london
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}

现在问题出现了：employee根据london进行routing，很有可能位于不同的shard上！！

所以我们需要指定一个额外的routing参数来确保跟parent /grandparent落在同一个shard上：

PUT /company/employee/1?parent=london&routing=uk 
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}

这里的routing value覆盖了parent value。

查询照常：比如我们返回employee喜欢hiking的country信息，就需要join country with branch，and branch with employee。

GET /company/country/_search
{
  "query": {
    "has_child": {
      "type": "branch",
      "query": {
        "has_child": {
          "type": "employee",
          "query": {
            "match": {
              "hobby": "hiking"
            }
          }
        }
      }
    }
  }
}

7：partical considerations

parent-child joins在管理存在关系的数据（索引性能比检索性能更重要）的时候是非常有用的，但是也带来了显著的开销。parent-child query的速度是nested query的5-10倍。

memory use：

目前parent-child的map信息仍然在内存中，es有计划用doc value去change map，这样会节省不少内存，但是目前还没有完成。在这之前，需要注意以下几个方面:

每一个parent的string类型的_id信息位于内存中，每一个child document需要8字节（压缩只需要1字节）。

我们可以查看parent-child cache的利用，用indices-stat api来获取index level的信息，用node-stat api来获取node level的信息。

GET /_nodes/stats/indices/id_cache?human

以上获取id cache的在每一个node上的情况，格式易读（human）

global ordinals and latecy：

parent-child用全局序来加速join。不管parent-child map用mem cache还是on-disk doc value，在index发生任何改变的时候全局序都需要重建。

同一个shard上的parent document越多，建立全局序的时间就越长。parent-child最佳适用场景是：每一个parent都拥有很多child。而不是parent很多child很少的情况。

全局序的建立是懒惰的。刷新后的第一个parent-child query或者aggregation到来的时候开始建立。这将会导致一个较大的延迟。我们可以用eager_global_ordinals来把这种延迟从query time转移到refresh time。

PUT /company
{
  "mappings": {
    "branch": {},
    "employee": {
      "_parent": {
        "type": "branch",
        "fielddata": {
          "loading": "eager_global_ordinals" 


        }
      }
    }
  }
}

parent的全局序在一个新的segment可用于检索之前建立。

parents数量很多的情形下。全局序的建立需要较长时间。我们可以增加refresh_interval，这样refresh频率降低，全局序有效时间较长。这会降低每秒重建全局序的cpu消耗。

multi-generations and concluding thoughts：

join multiple generation看上去很吸引人，但是要注意以下消耗：

join越多，性能越低。

每一个generation中parent id都需要存在内存中，消耗很大。

考虑你数据中存在的关系的scheme，如果适合parent-child，请考虑一下建议：

确保parent较少而children很多

避免在一个query中运行mutiple parent-child joins

避免score过程，讲score_mode设置为none

parent id尽量精简，减少内存使用

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航