您的位置：首页 > 其它

Latent Dirichlet Allocation(LDA)主题模型算法实现及源码解析

2015-04-14 01:10 537 查看

变量说明：

整个程序步骤如下图

代码解析

1 读入文档：

首先要读入语料库中的文档。每个文件行，开头是一个数字，代表有多少单词，接着是id:count的形式，id代表单词的id号，count代表文档中这个单词出现的次数。读入的文档集合保存在document结构体中。

typedef struct {

int len; //文档长度

int *id; //单词id数组

double *cnt;//各单词出现次数数组

} document;

main()函数中打开文件：

if ((data = feature_matrix(argv[optind], &nlex, &dlenmax)) == NULL) {

fprintf(stderr, "lda:: cannot open training data.\n");

exit(1);

}

2 调用lda_learn开始学习

void lda_learn (document *data, double *alpha, double **beta,

int nclass,

int nlex //单词列表长度,

int dlenmax//文档最大长度,

int emmax //最大迭代次数,

int demmax //inference迭代次数,

double epsilon //收敛阈值

{ …… ｝

1)首先随机规范初始化alpha，规范初始化beta.规范化也就是使得行或列之和为1

for (i = 0; i < nclass; i++)

alpha[i] = RANDOM;

for (i = 0, z = 0; i < nclass; i++)

z += alpha[i];

for (i = 0; i < nclass; i++)

alpha[i] = alpha[i] / z;

qsort(alpha, nclass, sizeof(double), // 为alpha排序

(int (*)(const void *, const void *))doublecmp);

//初始化beta

for (i = 0; i < nlex; i++)

for (j = 0; j < nclass; j++)

beta[i][j]
= (double) 1 / nlex;

2)初始化充分统计量，variational inference中需要用到的变量等

gammas和betas相当于gamma和beta的和，gamma和beta代表的是某一个文档，而gammas和betas代表的是所有文档

if ((gammas = dmatrix(n, nclass)) == NULL) {

fprintf(stderr, "lda_learn:: cannot allocate
gammas.\n");

return;

}

if ((betas = dmatrix(nlex, nclass)) == NULL) {

fprintf(stderr, "lda_learn:: cannot allocate
betas.\n");

return;

}

//initialize buffers

if ((q = dmatrix(dlenmax, nclass)) == NULL) {

fprintf(stderr, "lda_learn:: cannot allocate
q.\n");

return;

}

if ((gamma = (double *)calloc(nclass, sizeof(double))) == NULL)

{

fprintf(stderr, "lda_learn:: cannot allocate
gamma.\n");

return;

}

if ((ap = (double *)calloc(nclass, sizeof(double))) == NULL) {

fprintf(stderr, "lda_learn:: cannot allocate
ap.\n");

return;

}

if ((nt = (double *)calloc(nclass, sizeof(double))) == NULL) {

fprintf(stderr, "lda_learn:: cannot allocate
nt.\n");

return;

}

if ((pnt = (double*)calloc(nclass, sizeof(double))) == NULL) {

fprintf(stderr, "lda_learn:: cannot allocate
pnt.\n");

return;

}

3)开始EM迭代，如果迭代次数超过设定值则跳出

for (t = 0; t < emmax; t++)

{

printf("iteration %d/%d..\t", t + 1, emmax);

fflush(stdout);

//VB-E step

for (dp = data, i = 0; (dp->len) != -1; dp++, i++)

{

vbem(dp,
gamma, q, nt, pnt, ap,alpha,

(const double **)beta, dp->len, nclass, demmax);

//需要把得到的gamma和phi（变量q）的值累加到gammas和betas中，因为gamma和beta代表的只是其中一个文档，而gammas和betas代表的是所有文档，用来对alpha和beta进行参数估计

accum_gammas(gammas,
gamma, i, nclass);

accum_betas(betas, q, nclass, dp);

}

//VB-M step

//对beta的估计实际上就是规范化的一个过程，而对alpha的估计是使用牛顿迭代法的一个过程

newton_alpha(alpha, gammas, n,
nclass, 0);

//规范化矩阵的列，也就是求列的和，然后每个元素除以该和

normalize_matrix_col(beta, betas, nlex, nclass);

4)计算似然函数，判断是否收敛

lik = lda_lik(data, beta, gammas, n,
nclass);

原文地址：http://blog.sina.com.cn/s/blog_8eee7fb60101d06p.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航