您的位置：首页 > 其它

scikit-learn：通过TruncatedSVD实现LSA（隐含语义分析）

2015-07-13 21:10 453 查看

http://scikit-learn.org/stable/modules/decomposition.html#lsa

第2.5.2部分：

2.5.2. Truncated singular value decomposition and latent semantic analysis（截断SVD和LSA/LSA）

先说明：latent semantic indexing, LSI和latent semantic analysis, LSA本质一样。

TruncatedSVD是SVD的变形，只计算用户指定的最大的K，个奇异值。

runcated SVD 用于term-document matrices (as returned by CountVectorizer or TfidfVectorizer),
就是所谓的 latent
semantic analysis (LSA), because 他将term-document
matrices 转换到低维的“semantic” space。

再说一点，TruncatedSVD 类似于PCA,，不同的是TSVD直接处理样本矩阵X，而不是X的协方差矩阵。（如果feature-mean被减去后，TSVD和PCA的结果一样，也就是说，PCA是处理X的协方差矩阵，需要将整个训练样本矩阵X还原成邻人讨厌的high-dimensional
sparse marices，对于处理一个中等大小的文本集合这也很容易造成内存溢满。但TSVD直接使用scipy.sparse矩阵，不需要densify操作，所以推荐使用TSVD而不是PCA！）

使用TSVD走LSA/document处理时，推荐用tf-idf矩阵，而不是tf矩阵。特别的，需要设置 (sublinear_tf=True, use_idf=True)使特征值近似于高斯分布，这能弥补LSA对于文本数据的错误的前提假设（compensating
for LSA’s erroneous assumptions about textual data）。

最后给一个例子：Clustering text documents using k-means
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航