HAWQ取代传统数仓实践(十一)——维度表技术之维度合并
2017-05-31 15:03
721 查看
有一种合并维度的情况,就是本来属性相同的维度,因为某种原因被设计成重复的维度属性。例如,在销售订单示例中,随着数据仓库中维度的增加,我们会发现有些通用的数据存在于多个维度中。客户维度的客户地址相关信息、送货地址相关信息里都有邮编、城市和省份。下面说明如何把客户维度里的两个邮编相关信息合并到一个新的维度中。
图1
zip_code_dim维度表与销售订单事实表相关联。这个关系替换了事实表与客户维度的关系。sales_order_fact表需要两个关系,一个关联到客户地址邮编,另一个关联到送货地址邮编,相应的增加了两个外键字段。假设邮编相关信息不会修改,因此zip_code_dim表中没有是否删除、版本号、生效日期等SCD属性。
下面的脚本用于修改数据仓库模式,所做的修改如下。
创建邮编维度表zip_code_dim。
初始装载邮编相关数据。
基于zip_code_dim表创建v_customer_zip_code_dim和v_shipping_zip_code_dim视图。
在sales_order_fact表上增加customer_zip_code_sk和shipping_zip_code_sk列。
基于已有的客户邮编和送货邮编初始装载两个邮编代理键。
在customer_dim表上删除客户和送货邮编及其它们的城市和州列。
在pa_customer_dim上删除客户的城市、州和邮编列。
邮编维度的初始数据是从客户维度表中来,这只是为了演示数据装载的过程。客户的邮编信息很可能覆盖不到所有邮编,所以更好的方法是装载一个完整的邮编信息表。由于客户地址和送货地址可能存在交叉的情况,因此使用distinct去重。送货地址的三个字段是后加的,在此之前数据的送货地址为空,邮编维度表中不能含有NULL值,所以要加上where shipping_zip_code is not null过滤条件去除邮编信息为NULL的数据行。
基于邮编维度表创建客户邮编和送货邮编视图,分别用作两个地理信息的角色扮演维度。
把数据备份表sales_order_fact_bak中的数据装载回销售订单事实表,同时需要关联两个邮编角色维度视图,查询出两个代理键,装载到事实表中。注意老的事实表与新的邮编维度表是通过客户维度表关联起来的,所以在子查询中需要三表连接,然后用两个左外连接查询出所有原事实表数据,装载到新的增加了邮编维度代理键的事实表中。
在customer_dim表上删除列时,需要使用cascade子句同时删除依赖它的视图,之后重建相关视图。
删除客户维度装载里所有邮编信息相关的列,因为客户维度里不再有客户邮编和送货邮编相关信息。
在事实表中引用客户邮编视图和送货邮编视图中的代理键。
修改pa_customer_dim装载,因为需要从销售订单事实表的customer_zip_code_sk获取客户邮编。
修改后的fn_regular_load函数如下。
对源数据的客户邮编相关信息做一些修改。
装载新的客户数据前,查询最后的客户和送货邮编,后面可以用改变后的信息和此查询的输出作对比。
新增销售订单源数据。
执行定期装载。
查询客户维度表、售订单事实表和PA子维度表,确认数据已经正确装载。
执行下面的语句,对源数据的客户信息做以下两处修改:客户编号4的客户和送货邮编信息;新增一个编号15的客户。
图2
查询sales_order_fact表里的两条新销售订单,确认邮编已经正确装载。
图3
查询v_pa_customer_dim_his视图,确认PA客户正确装载。
图4
一、修改数据仓库表结构
为了合并维度,需要改变数据仓库表结构。图1显示了修改后的结构。新增了一个zip_code_dim邮编信息维度表,sales_order_fact事实表的结构也做了相应的修改。图1
zip_code_dim维度表与销售订单事实表相关联。这个关系替换了事实表与客户维度的关系。sales_order_fact表需要两个关系,一个关联到客户地址邮编,另一个关联到送货地址邮编,相应的增加了两个外键字段。假设邮编相关信息不会修改,因此zip_code_dim表中没有是否删除、版本号、生效日期等SCD属性。
下面的脚本用于修改数据仓库模式,所做的修改如下。
创建邮编维度表zip_code_dim。
初始装载邮编相关数据。
基于zip_code_dim表创建v_customer_zip_code_dim和v_shipping_zip_code_dim视图。
在sales_order_fact表上增加customer_zip_code_sk和shipping_zip_code_sk列。
基于已有的客户邮编和送货邮编初始装载两个邮编代理键。
在customer_dim表上删除客户和送货邮编及其它们的城市和州列。
在pa_customer_dim上删除客户的城市、州和邮编列。
set search_path=tds; -- 建立邮编维度表 create table zip_code_dim ( zip_code_sk serial, zip_code int, city varchar(30), state varchar(2) ); comment on table zip_code_dim is '邮编维度表'; comment on column zip_code_dim.zip_code_sk is '邮编维度代理键'; comment on column zip_code_dim.zip_code is '邮编'; comment on column zip_code_dim.city is '城市'; comment on column zip_code_dim.state is '省份'; -- 初始装载邮编相关数据 insert into zip_code_dim (zip_code, city, state) select distinct * from (select customer_zip_code, customer_city, customer_state from customer_dim where customer_zip_code is not null union all select shipping_zip_code, shipping_city, shipping_state from customer_dim where shipping_zip_code is not null) t1; -- 创建视图 create view v_customer_zip_code_dim (customer_zip_code_sk, customer_zip_code, customer_city, customer_state) as select * from zip_code_dim; create view v_shipping_zip_code_dim (shipping_zip_code_sk, shipping_zip_code, shipping_city, shipping_state) as select * from zip_code_dim; -- 添加邮编代理键 alter table sales_order_fact add column customer_zip_code_sk int default null; alter table sales_order_fact add column shipping_zip_code_sk int default null; comment on column sales_order_fact.customer_zip_code_sk is '客户邮编代理键'; comment on column sales_order_fact.shipping_zip_code_sk is '送货邮编代理键'; -- 初始装载两个邮编代理键 create table sales_order_fact_bak as select * from sales_order_fact; truncate table sales_order_fact; insert into sales_order_fact select t1.order_number, t1.customer_sk, t1.product_sk, t1.order_date_sk, t1.year_month, t1.order_amount, t1.order_quantity, t1.request_delivery_date_sk, t1.sales_order_attribute_sk, t2.customer_zip_code_sk, t3.shipping_zip_code_sk from sales_order_fact_bak t1 left join (select a.order_number order_number,c.customer_zip_code_sk customer_zip_code_sk from sales_order_fact_bak a, customer_dim b, v_customer_zip_code_dim c where a.customer_sk = b.customer_sk and b.customer_zip_code = c.customer_zip_code) t2 on t1.order_number = t2.order_number left join (select a.order_number order_number,c.shipping_zip_code_sk shipping_zip_code_sk from sales_order_fact_bak a, customer_dim b, v_shipping_zip_code_dim c where a.customer_sk = b.customer_sk and b.shipping_zip_code = c.shipping_zip_code) t3 on t1.order_number = t3.order_number; drop table sales_order_fact_bak; -- 在customer_dim表上删除客户和送货邮编及其它们的城市和州列。 alter table customer_dim drop column customer_zip_code cascade; alter table customer_dim drop column customer_city; alter table customer_dim drop column customer_state; alter table customer_dim drop column shipping_zip_code; alter table customer_dim drop column shipping_city; alter table customer_dim drop column shipping_state; alter table pa_customer_dim drop column customer_zip_code; alter table pa_customer_dim drop column customer_city; alter table pa_customer_dim drop column customer_state; alter table pa_customer_dim drop column shipping_zip_code; alter table pa_customer_dim drop column shipping_city; alter table pa_customer_dim drop column shipping_state; -- 重建相关视图 create or replace view v_customer_dim_latest as select customer_sk, customer_number, customer_name, customer_street_address, version, effective_date, shipping_address from (select distinct on (customer_number) customer_number, customer_sk, customer_name, customer_street_address, isdelete, version, effective_date, shipping_address from customer_dim order by customer_number, customer_sk desc) as latest where isdelete is false; create or replace view v_customer_dim_his as select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date from customer_dim; create or replace view v_pa_customer_dim_latest as select customer_sk, customer_number, customer_name, customer_street_address, version, effective_date, shipping_address from (select distinct on (customer_number) customer_number, customer_sk, customer_name, customer_street_address, isdelete, version, effective_date, shipping_address from pa_customer_dim order by customer_number, customer_sk desc) as latest where isdelete is false; create or replace view v_pa_customer_dim_his as select *, date(lead(effective_date,1,date '2200-01-01') over (partition by customer_number order by effective_date)) expiry_date from pa_customer_dim;说明:
邮编维度的初始数据是从客户维度表中来,这只是为了演示数据装载的过程。客户的邮编信息很可能覆盖不到所有邮编,所以更好的方法是装载一个完整的邮编信息表。由于客户地址和送货地址可能存在交叉的情况,因此使用distinct去重。送货地址的三个字段是后加的,在此之前数据的送货地址为空,邮编维度表中不能含有NULL值,所以要加上where shipping_zip_code is not null过滤条件去除邮编信息为NULL的数据行。
基于邮编维度表创建客户邮编和送货邮编视图,分别用作两个地理信息的角色扮演维度。
把数据备份表sales_order_fact_bak中的数据装载回销售订单事实表,同时需要关联两个邮编角色维度视图,查询出两个代理键,装载到事实表中。注意老的事实表与新的邮编维度表是通过客户维度表关联起来的,所以在子查询中需要三表连接,然后用两个左外连接查询出所有原事实表数据,装载到新的增加了邮编维度代理键的事实表中。
在customer_dim表上删除列时,需要使用cascade子句同时删除依赖它的视图,之后重建相关视图。
二、修改定期数据装载函数
定期装载函数有三个地方的修改:删除客户维度装载里所有邮编信息相关的列,因为客户维度里不再有客户邮编和送货邮编相关信息。
在事实表中引用客户邮编视图和送货邮编视图中的代理键。
修改pa_customer_dim装载,因为需要从销售订单事实表的customer_zip_code_sk获取客户邮编。
修改后的fn_regular_load函数如下。
create or replace function fn_regular_load () returns void as $$ declare -- 设置scd的生效时间 v_cur_date date := current_date; v_pre_date date := current_date - 1; v_last_load date; begin -- 分析外部表 analyze ext.customer; analyze ext.product; analyze ext.sales_order; -- 将外部表数据装载到原始数据表 truncate table rds.customer; truncate table rds.product; insert into rds.customer select * from ext.customer; insert into rds.product select * from ext.product; insert into rds.sales_order select order_number, customer_number, product_code, order_date, entry_date, order_amount, order_quantity, request_delivery_date, verification_ind, credit_check_flag, new_customer_ind, web_order_flag from ext.sales_order; -- 分析rds模式的表 analyze rds.customer; analyze rds.product; analyze rds.sales_order; -- 设置cdc的上限时间 select last_load into v_last_load from rds.cdc_time; truncate table rds.cdc_time; insert into rds.cdc_time select v_last_load, v_cur_date; -- 装载客户维度 insert into tds.customer_dim (customer_number, customer_name, customer_street_address, shipping_address, isdelete, version, effective_date) select case flag when 'D' then a_customer_number else b_customer_number end customer_number, case flag when 'D' then a_customer_name else b_customer_name end customer_name, case flag when 'D' then a_customer_street_address else b_customer_street_address end customer_street_address, case flag when 'D' then a_shipping_address else b_shipping_address end shipping_address, case flag when 'D' then true else false end isdelete, case flag when 'D' then a_version when 'I' then 1 else a_version + 1 end v, v_pre_date from (select a.customer_number a_customer_number, a.customer_name a_customer_name, a.customer_street_address a_customer_street_address, a.shipping_address a_shipping_address, a.version a_version, b.customer_number b_customer_number, b.customer_name b_customer_name, b.customer_street_address b_customer_street_address, b.shipping_address b_shipping_address, case when a.customer_number is null then 'I' when b.customer_number is null then 'D' else 'U' end flag from v_customer_dim_latest a full join rds.customer b on a.customer_number = b.customer_number where a.customer_number is null -- 新增 or b.customer_number is null -- 删除 or (a.customer_number = b.customer_number and not (coalesce(a.customer_name,'') = coalesce(b.customer_name,'') and coalesce(a.customer_street_address,'') = coalesce(b.customer_street_address,'') and coalesce(a.shipping_address,'') = coalesce(b.shipping_address,'') ))) t order by coalesce(a_customer_number, 999999999999), b_customer_number limit 999999999999; -- 装载产品维度 insert into tds.product_dim (product_code, product_name, product_category, isdelete, version, effective_date) select case flag when 'D' then a_product_code else b_product_code end product_code, case flag when 'D' then a_product_name else b_product_name end product_name, case flag when 'D' then a_product_category else b_product_category end product_category, case flag when 'D' then true else false end isdelete, case flag when 'D' then a_version when 'I' then 1 else a_version + 1 end v, v_pre_date from (select a.product_code a_product_code, a.product_name a_product_name, a.product_category a_product_category, a.version a_version, b.product_code b_product_code, b.product_name b_product_name, b.product_category b_product_category, case when a.product_code is null then 'I' when b.product_code is null then 'D' else 'U' end flag from v_product_dim_latest a full join rds.product b on a.product_code = b.product_code where a.product_code is null -- 新增 or b.product_code is null -- 删除 or (a.product_code = b.product_code and not (a.product_name = b.product_name and a.product_category = b.product_category))) t order by coalesce(a_product_code, 999999999999), b_product_code limit 999999999999; -- 装载销售订单事实表 insert into sales_order_fact select a.order_number, customer_sk, product_sk, e.date_sk, e.year * 100 + e.month, order_amount, order_quantity, f.date_sk, g.sales_order_attribute_sk, h.customer_zip_code_sk, i.shipping_zip_code_sk from rds.sales_order a, v_customer_dim_his c, v_product_dim_his d, date_dim e, date_dim f, sales_order_attribute_dim g, v_customer_zip_code_dim h, v_shipping_zip_code_dim i, rds.customer j, rds.cdc_time k where a.customer_number = c.customer_number and a.order_date >= c.effective_date and a.order_date < c.expiry_date and a.product_code = d.product_code and a.order_date >= d.effective_date and a.order_date < d.expiry_date and date(a.order_date) = e.date and date(a.request_delivery_date) = f.date and a.verification_ind = g.verification_ind and a.credit_check_flag = g.credit_check_flag and a.new_customer_ind = g.new_customer_ind and a.web_order_flag = g.web_order_flag and a.customer_number = j.customer_number and j.customer_zip_code = h.customer_zip_code and j.shipping_zip_code = i.shipping_zip_code and a.entry_date >= k.last_load and a.entry_date < k.current_load; -- 重载PA客户维度 truncate table pa_customer_dim; insert into pa_customer_dim select distinct a.* from customer_dim a, sales_order_fact b, v_customer_zip_code_dim c where c.customer_state = 'pa' and b.customer_zip_code_sk = c.customer_zip_code_sk and a.customer_sk = b.customer_sk; -- 分析tds模式的表 analyze customer_dim; analyze product_dim; analyze sales_order_fact; analyze pa_customer_dim; -- 更新时间戳表的last_load字段 truncate table rds.cdc_time; insert into rds.cdc_time select v_cur_date, v_cur_date; end; $$ language plpgsql;上面的函数需要注意两个地方。装载事实表数据时,除了关联两个邮编维度视图外,还要关联过渡区的rds.customer表。这是因为要取得邮编维度代理键,必须连接邮编代码字段,而邮编代码已经从客户维度表中删除,只有在源数据的客户表中保留。第二个改变是PA子维度的装载。州代码已经从客户维度表删除,被放到了新的邮编维度表中,而客户维度和邮编维度并没有直接关系,它们是通过事实表的客户代理键和邮编代理键产生联系,因此必须关联事实表、客户维度表、邮编维度表三个表才能取出PA子维度数据。这也就是把PA子维度的装载放到了事实表装载之后的原因。
三、测试
按照以下步骤测试修改后的定期装载脚本。对源数据的客户邮编相关信息做一些修改。
装载新的客户数据前,查询最后的客户和送货邮编,后面可以用改变后的信息和此查询的输出作对比。
新增销售订单源数据。
执行定期装载。
查询客户维度表、售订单事实表和PA子维度表,确认数据已经正确装载。
执行下面的语句,对源数据的客户信息做以下两处修改:客户编号4的客户和送货邮编信息;新增一个编号15的客户。
update source.customer set customer_street_address = '9999 louise dr.', customer_zip_code = 17055, customer_city = 'pittsburgh', shipping_address = '9999 louise dr.', shipping_zip_code = 17055, shipping_city = 'pittsburgh' where customer_number = 4; insert into source.customer values(15, 'super stores', '1000 woodland st.', 17055, 'pittsburgh', 'pa', '1000 woodland st.', 17055, 'pittsburgh', 'pa'); commit;现在在装载新的客户数据前查询最后的客户和送货邮编。后面可以用改变后的信息和此查询的输出作对比。查询语句如下。
select order_date_sk odsk, customer_number cn, customer_zip_code czc, shipping_zip_code szc from v_customer_zip_code_dim a, v_shipping_zip_code_dim b, sales_order_fact c, customer_dim d where a.customer_zip_code_sk = c.customer_zip_code_sk and b.shipping_zip_code_sk = c.shipping_zip_code_sk and d.customer_sk = c.customer_sk; order by odsk;然后使用下面的语句新增两条销售订单。
set @order_date := from_unixtime(unix_timestamp('2017-05-30 00:00:01') + rand() * (unix_timestamp('2017-05-30 12:00:00') - unix_timestamp('2017-05-30 00:00:01'))); set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400); set @amount := floor(1000 + rand() * 9000); set @quantity := floor(10 + rand() * 90); insert into source.sales_order values (null, 4, 3, 'y', 'y', 'y', 'n', @order_date, @request_delivery_date, @order_date, @amount, @quantity); set @order_date := from_unixtime(unix_timestamp('2017-05-30 12:00:00') + rand() * (unix_timestamp('2017-05-31 00:00:00') - unix_timestamp('2017-05-30 12:00:00'))); set @request_delivery_date := from_unixtime(unix_timestamp(date_add(current_date, interval 5 day)) + rand() * 86400); set @amount := floor(1000 + rand() * 9000); set @quantity := floor(10 + rand() * 90); insert into source.sales_order values (null, 15, 4, 'y', 'n', 'y', 'n', @order_date, @request_delivery_date, @order_date, @amount, @quantity); commit;执行下面的命令定期装载。
~/regular_etl.sh查询customer_dim表,确认两个改变的客户,即编号4和15的客户,已经正确装载。
select customer_sk csk, customer_number cnum, customer_name cnam, customer_street_address csd, shipping_address sd, version, effective_date, expiry_date from v_customer_dim_his where customer_number in (4, 15);查询结果如图2所示。
图2
查询sales_order_fact表里的两条新销售订单,确认邮编已经正确装载。
select a.order_number onum, e.customer_number cnum, b.customer_zip_code czc, c.shipping_zip_code szc, f.product_code pc, d.order_date od, a.order_amount, a.order_quantity from sales_order_fact a, v_customer_zip_code_dim b, v_shipping_zip_code_dim c, v_order_date_dim d, customer_dim e, product_dim f where a.customer_sk = e.customer_sk and a.product_sk = f.product_sk and a.customer_zip_code_sk = b.customer_zip_code_sk and a.shipping_zip_code_sk = c.shipping_zip_code_sk and a.order_date_sk = d.order_date_sk order by a.order_number desc limit 2;查询结果如图3所示。
图3
查询v_pa_customer_dim_his视图,确认PA客户正确装载。
select customer_sk csk, customer_number cnum, customer_name cnam, customer_street_address csa, shipping_address sad, version, effective_date, expiry_date from v_pa_customer_dim_his order by customer_sk;查询结果如图4所示。
图4
相关文章推荐
- HAWQ取代传统数仓实践(八)——维度表技术之角色扮演维度
- HAWQ取代传统数仓实践(九)——维度表技术之退化维度
- HAWQ取代传统数仓实践(十)——维度表技术之杂项维度
- HAWQ取代传统数仓实践(七)——维度表技术之维度子集
- HAWQ取代传统数仓实践(十二)——维度表技术之分段维度
- HAWQ取代传统数仓实践(十三)——事实表技术之周期快照
- HAWQ取代传统数仓实践(十七)——事实表技术之累积度量
- HAWQ取代传统数仓实践(十六)——事实表技术之迟到的事实
- HAWQ取代传统数仓实践(十四)——事实表技术之累积快照
- HAWQ取代传统数仓实践(十五)——事实表技术之无事实的事实表
- HAWQ取代传统数仓实践(十八)——层次维度
- HAWQ取代传统数仓实践(五)——自动调度工作流(Oozie、Falcon)
- HAWQ取代传统数仓实践(二)——搭建示例模型(MySQL、HAWQ)
- HAWQ取代传统数仓实践(一)——为什么选择HAWQ
- HAWQ取代传统数仓实践(十九)——OLAP
- HAWQ取代传统数仓实践(三)——初始ETL(Sqoop、HAWQ)
- HAWQ取代传统数仓实践(四)——定期ETL(Sqoop、HAWQ)
- HAWQ取代传统数仓实践(六)——增加列
- Activiti(Flowable)流程克隆(运行时实例克隆)解决方案 一种订单制造中分裂拆单与合并的核心技术实践
- Activiti(Flowable)流程克隆(运行时实例克隆)解决方案 一种订单制造中分裂与合并的核心技术实践