您的位置:首页 > 数据库 > Mongodb

Mongodb数据统计分析:group by、count(distinct userId)

yown 2016-12-30 22:31 127 查看
最近在做mongodb数据统计查询,需求:统计一段时间内每天的分享次数和分享人数? 查了众多资料,居然未找到好的方案,最终还是自己写函数解决了,现分享出来(若有好的方法请指出):

表t_share_log结构(习惯这么叫,虽mongodb中为collection):

字段名类型说明
idstring记录id
gmtCreatedate创建时间
userIdint用户id
urlstring分享地址
2. 那mongodb中如何写呢?我们先看mysql写法:

select
date_format(gmtCreate,'%Y%m%d') day,
count(1) shareCount,
count(distinct userId) shareUserCount
from t_share_log
where gmtCreate between str_to_date(20161001, '%Y%m%d') and str_to_date(20161231, '%Y%m%d');


那么在mongodb中如何表达呢? 几经调试,先通过定义keyf的function格式化group by的日期值,再通过定义reduce:function函数借助userIdMap去重userId(相当于dintinct userId),最后完整的命令如下:

mongodb命令行:
db.runCommand({group:
{
ns:"t_share_log",
cond : { "$and":[{"gmtCreate":{"$gt":new ISODate("2016-10-01T00:00:00.000Z")}}, {"gmtCreate":{"$lt":new ISODate("2016-12-31T23:59:59.999Z")}}]},
$keyf:function(doc){
var myDate = new Date(doc.gmtCreate);
var mm = '0'+(myDate.getMonth()+1);
var dd = '0'+myDate.getDate();
return {day:myDate.getFullYear()+''+mm.substring(mm.length-2)+''+dd.substring(dd.length-2)};
},
initial:{"shareCount" : 0 , "shareUserCount" : 0 , "userIdMap" : {}},
$reduce:function(doc, prev){
if(doc.userId != null){
prev.shareCount ++;
if(prev.userIdMap[doc.userId] == null) {
prev.shareUserCount ++;
prev.userIdMap[doc.userId] = 1;
}
}
},
finalize: function(doc){ delete doc.userIdMap; }
}
});


运行以上语句,得到如下结果(正确无误,是不是很有意思呢):

{
"retval" : [
{
"day" : "20161129",
"shareCount" : 12,
"shareUserCount" : 4
},
{
"day" : "20161130",
"shareCount" : 21,
"shareUserCount" : 9
}
],
"count" : NumberLong(174),
"keys" : NumberLong(8),
"ok" : 1
}


3.最后我们用java来实现,代码如下:

private GroupByResults<Map> queryShareStatistics(String beginTime, String endTime)
{
Criteria criteria = null;

if(!StringUtils.isEmpty(beginTime)) {
try
{
criteria = Criteria.where("gmtCreate").gte(Utils.SDF_FULLTIME_FORMAT.parse(beginTime));
}
catch (ParseException e)
{
logger.error(e);
}
}
if(!StringUtils.isEmpty(endTime)) {
try{
if(criteria == null) {
criteria = Criteria.where("gmtCreate").lte(Utils.SDF_FULLTIME_FORMAT.parse(endTime));
}else {
criteria.lte(Utils.SDF_FULLTIME_FORMAT.parse(endTime));
}
}
catch (ParseException e)
{
logger.error(e);
}
}

if(criteria == null) {
criteria = new Criteria();
}

//按日统计分享次数与人数
GroupBy groupBy = GroupBy.keyFunction("function(doc){"

+ "var myDate = new Date(doc.gmtCreate);"
+ "          var mm = '0'+(myDate.getMonth()+1); "
+ "          var dd = '0'+myDate.getDate();"
+ "          return {day:myDate.getFullYear()+''+mm.substring(mm.length-2)+''+dd.substring(dd.length-2)};"
+ "}");

groupBy.initialDocument("{shareCount:0, shareUserCount:0, userIdMap:{}}")
.reduceFunction("function(doc, prev){ "
+ "if(doc.userId != null){"
+ "  prev.shareCount ++;"
+ "  if(prev.userIdMap[doc.userId] == null) { "
+ "      prev.shareUserCount ++; "
+ "      prev.userIdMap[doc.userId] = 1;"
+ "  }"
+ "}"
+ "}")
.finalizeFunction("function(prev){delete prev.userIdMap}");

GroupByResults<Map> r = statisticsTemplate.group(criteria, "t_share_log", groupBy, Map.class);

return r;
}


综上所述,先通过定义keyf的function格式化group by的日期值,再通过定义reduce:function函数借助userIdMap去重userId(相当于dintinct userId)。欢迎大家提出更好的办法!