您的位置:首页 > 运维架构 > Apache

Apache mahout 源码阅读笔记--DataModel之FileDataModel

2014-11-18 18:18 363 查看
要做推荐,用户行为数据是基础。

用户行为数据有哪些字段呢?

mahout的DataModel支持,用户ID,ItemID是必须的,偏好值(用户对当前Item的评分),时间戳 这四个字段

{@code userID,itemID[,preference[,timestamp]]}

mahout数据源支持从文件、DB中读取。

从FileDataModle.java的注释来看,还是做了不少工作的。

1)原文件更新后一定的时间段,才会reload

2)支持增量更新(不用每次都重新copy所有数据)

3)根据字段数目(有无评分)来选择不同的结构存储,节省内存

另外,

4)自己实现基础类型的数据结构,节省内存

~/mahout-core/src/main/java/org/apache/mahout/cf/taste/impl/common/FastIDSet.java

~/mahout-core/src/main/java/org/apache/mahout/cf/taste/impl/common/FastByIDMap.java

自己的实现的两个数据类型,都是通过hash快速查找, 而且避免java的Long class, 直接采用原生态的long行来节省内存空间。

同类型的还有 FastMap.java

* <p>
* 增量更新的方式, This class will also look for update "delta" files in the same
* directory, with file names that start the same way (up to the first period).
* These files have the same format, and provide updated data that supersedes
* what is in the main data file. This is a mechanism that allows an application
* to push updates to {@link FileDataModel} without re-copying the entire data
* file.
*
* 同一个目录下,数字来区分
* Finds update delta files in the same directory as the data file. This finds
* any file whose name starts the same way as the data file (up to first period)
* but isn't the data file itself. For example, if the data file is
* /foo/data.txt.gz, you might place update files at /foo/data.1.txt.gz,
* /foo/data.2.txt.gz, etc.
* </p>
*
* <p>
* 表示删除的语法, 偏好为空 One small format difference exists. Update files must also be
* able to express deletes. This is done by ending with a blank preference
* value, as in "123,456,".
* </p>
*
* <p>
* 增量更新的文件中,删除和更新不能混合使用 Note that it's all-or-nothing -- all of the items in the
* file must express no preference, or the all must. These cannot be mixed. Put
* another way there will always be the same number of delimiters on every line
* of the file!
* </p>


FileDataModel封装了从文件读取的功能,具体的存储还是由GenericDataModel来实现的。



详细的数据承载有这篇文章,这里就不多着墨了。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: