您的位置:首页 > 其它

【转载】汇总:LDA理论、变形、优化、应用、工具库

2015-12-14 20:16 363 查看
原文地址:http://site.douban.com/204776/widget/notes/12599608/note/287085506/

2013-07-08 19:22:18

http://www.douban.com/note/287085419/

啥了不说了,这几天简直成魔了。

自己的LDA框架也整理好了,接下来重新梳理一遍这边就算任督二脉打通啦!

#LDA理论

——Topic Model相关论文汇总

http://site.douban.com/204776/widget/notes/12599608/note/286839088/

##Survey:

1. 基于文档主题结构的关键词抽取方法研究

刘知远的博士论文,他是当时微博关键词应用的作者我记得。

在短文本上也提出了一些方法改进。

2. Parameter estimation for text analysis

这篇绝对是重量级。

#Short-Text:

1. Automatic Keyphrase Extraction by Bridging Vocabulary Gap

#Practice / In Action (especially in Chinese)

1. A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

2. A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora

Statistical Substring Reduction in Linear Time

3. The Mathematics of Statistical Machine Translation: Parameter Estimation

##Anecdote:

LDA数学八卦

rickjin写的,统计之都上连载的。

http://vdisk.weibo.com/s/qghK5

##LDA variation:

最近有个女人极其强大,总结了各种LDA变形。

在她发的两篇近期论文里:

1. On the design of LDA models for aspect-based opinion mining

2. The FLDA model for aspect-based opinion mining: addressing the cold start problem (WWW'13)

##我看过的几乎LDA paper所有打包

有一定是加过重点的(-noted):

有上面提到的一些论文,但比那个多的多。

可以直接看里面noted的文件夹,因为没note过的我觉得没用。

http://vdisk.weibo.com/s/BA3xC

#LDA优化

——LDA优化实现论文汇总

http://site.douban.com/204776/widget/notes/12599608/note/286923972/

觉得比较有实际应用上的价值,因为文本数量有时候很多,实现上的优化就很必要了。

快速推理算法:

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

在线学习:

Online Learning for Latent Dirichlet Allocation

http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf

http://videolectures.net/nips2010_hoffman_oll/

www.ece.duke.edu/~lcarin/Lingbo4.15.2011.pptx

文本流的推理算法;

Topic models over text streams: a study of batch and online unsupervised learning

Efficient Methods for Topic Model Inference on Streaming Document Collections

分布式学习:

Distributed Inference for Latent Dirichlet Allocation

PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing

#LDA应用

——LDA应用变形

http://site.douban.com/204776/widget/notes/12599608/note/286930572/

说说LDA在不同应用上的几个变形,都有细微调整也都带来了新的问题。

##情感分析

Opinion Integration Through Semi-supervised Topic Modeling

把传统的Topic Model作为非监督的典型,拓展成了半监督。加入了模型的先验信息,对于一些汽车产品,从维基百科中提出它的各个特征的描述,然后训练成先验信息。

Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

联合抽取主题和观点。引入监督学习的方法,区分主题和情感词汇。进一步再用LDA进行聚类。

##学术挖掘

比如KDD2013今年也有的作者建模,再比如学术热点探测……

The author-topic model for authors and documents

同时对作者和主题进行建模。每个作者再限定该作者只能对应一个主题,每个作者也是主题上的一个分布,同时用作者~主题的分布取代文档~主题的分布。

Joint latent topic models for text and citations

对主题和引用同事建模,建立引用关系链接。

Detecting Topic Evolution in Scientific Literature: How Can Citations Help?

通过引用信息,建立主题进化模型。

##社会媒体主题

Twitter的研究太多了,小站SNA部分也总结过很多了。不多写了。

#LDA工具库

——LDA工具库

http://site.douban.com/204776/widget/notes/12599608/note/287084873/

(这部分还缺R,等我自己用过再做评价)

先发一个格式比较好的链接(但不全):

http://mengjunxie.github.io/ae-lda/topic-modeling.html


####

Latent Dirichlet allocation

http://www.cs.princeton.edu/~blei/lda-c/

This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see
the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

####

Discrete Component Analysis

http://www.nicta.com.au/people/buntinew/discrete_component_analysis

The Discrete Component Analysis (DCA) software is being developed as a stand-alone package, and as a plug-in to the Elefant system, a machine learning toolbox from NICTA. Currently the software is being run in stand-alone mode using the data streaming libraries
from the older and now unsupported MPCA system, developed at Helsinki Institute for IT. The software itself is written in the C language and compiles on a Linux and a Mac OS X environment.

The models presented here are known under many names, such as latent Dirichlet allocation, multi-aspect models, multinomial PCA, and non-negative matrix factorisation.

####

Infinite LDA

http://www.arbylon.net/projects/knowceans-ilda/readme.txt

https://bitbucket.org/gchrupala/colada/wiki/Resources

Implementations of Latent Dirichlet Allocation (LDA) and

Hierarchical Dirichlet Processes (HDP)

@author Gregor Heinrich, gregor :: arbylon : net

@version 0.96

@date 1 Mar 2011

- History: ILDA version 0.1: May 2008, LDA version 0.1: Feb. 2005, based

on http://arbylon.net/projects/LdaGibbsSampler.java

- Simple implementations of Gibbs sampling for LDA and HDP

- Scientific documentation: see texts lda.pdf and ilda.pdf

- Technical documentation: see Javadoc and source (packages *.corpus and

*.utils are from knowceans-tools on SourceForge)

- Data documentation: see nips/readme.txt including source references

- License: All code is licensed under GPL v3.0.

- If the code is used in scientific work, please refer to its source

via the URL:

http://arbylon.net/projects/knowceans-ilda.zip

or the documentation of the ILDA or LDA implementations:

G. Heinrich. "Infinite LDA" -- implementing the HDP with minimum code

complexity. TN2011/1, http://arbylon.net/publications/ilda.pdf, 2011

G. Heinrich. Parameter estimation for text analysis. Technical report,

No. 09RP008-FIGD, Fraunhofer IGD, 2009

TODO:

- Diverse checks, e.g., Antoniak distribution sampling, hyperparameter

estimators, general quantitative validation of HDP model

- Output formatting

- Visual matrix implementation for HDP / IldaGibbs

####

MAchine Learning for LanguagE Toolkit

http://mallet.cs.umass.edu/

MALLET is open source software [License]. For research use, please remember to cite MALLET.

Download MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance
using several commonly used metrics.

####

Multithreaded LDA

https://sites.google.com/site/rameshnallapati/software

Multithreaded extension of Blei's LDA implementation. C Ramesh Nallapati Speeds up the computation by orders of magnitude depending on the number of processors.

####

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation

https://sites.google.com/site/rameshnallapati/software

GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large
collections of text/Web documents. LDA was first introduced by David Blei et al [Blei03]. There have been several implementations of this model in C (using Variational Methods), Java, and Matlab. We decided to release this implementation of LDA in C/C++ using
Gibbs Sampling to provide an alternative to the topic-model community.

GibbsLDA++ is useful for the following potential application areas:

Information retrieval and search (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search).

Document classification/clustering, document summarization, and text/web mining community in general.

Content-based image clustering, object recognition, and other applications of computer vision in general.

Other potential applications in biological data.

####

Gensim

http://radimrehurek.com/gensim/

Gensim is a FREE Python library

Scalable statistical semantics

Analyze plain-text documents for semantic structure

Retrieve semantically similar documents

####

Stanford Topic Modeling Toolbox

http://nlp.stanford.edu/software/tmt/tmt-0.4/

The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to:

Import and manipulate text from cells in Excel and other spreadsheets.

Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text.

Select parameters (such as the number of topics) via a data-driven process.

Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data.

The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by:

Daniel Ramage and Evan Rosen, first released in September 2009.

####

Matlab Topic Modeling Toolbox 1.4

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Installation & Licensing

Download the zipped toolbox (18Mb).

NOTE: this toolbox now works with 64 bit compilers. If you are looking for the old version of this toolbox that has the code for 32 bit compilers, download this version

The program is free for scientific use. Please contact the authors, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. By using this software, you are agreeing
to this license statement.

Type 'help function' at command prompt for more information on each function

Read these notes on data format for a description on the input and output format for the different topic models

Note for MAC and Linux users: some of the Matlab functions are implemented with mex code (C code linked to Matlab). For windows based platforms, the dll's are already provided in the distribution package. For other platforms, please compile the mex functions
by executing "compilescripts" at the Matlab prompt

#

最后的最后,

发个Topic Modeling Bibliography

http://www.cs.princeton.edu/~mimno/topics.html

2013-07-08 19:22:18

http://www.douban.com/note/287085419/

啥了不说了,这几天简直成魔了。

自己的LDA框架也整理好了,接下来重新梳理一遍这边就算任督二脉打通啦!

#LDA理论

——Topic Model相关论文汇总

http://site.douban.com/204776/widget/notes/12599608/note/286839088/

##Survey:

1. 基于文档主题结构的关键词抽取方法研究

刘知远的博士论文,他是当时微博关键词应用的作者我记得。

在短文本上也提出了一些方法改进。

2. Parameter estimation for text analysis

这篇绝对是重量级。

#Short-Text:

1. Automatic Keyphrase Extraction by Bridging Vocabulary Gap

#Practice / In Action (especially in Chinese)

1. A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

2. A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora

Statistical Substring Reduction in Linear Time

3. The Mathematics of Statistical Machine Translation: Parameter Estimation

##Anecdote:

LDA数学八卦

rickjin写的,统计之都上连载的。

http://vdisk.weibo.com/s/qghK5

##LDA variation:

最近有个女人极其强大,总结了各种LDA变形。

在她发的两篇近期论文里:

1. On the design of LDA models for aspect-based opinion mining

2. The FLDA model for aspect-based opinion mining: addressing the cold start problem (WWW'13)

##我看过的几乎LDA paper所有打包

有一定是加过重点的(-noted):

有上面提到的一些论文,但比那个多的多。

可以直接看里面noted的文件夹,因为没note过的我觉得没用。

http://vdisk.weibo.com/s/BA3xC

#LDA优化

——LDA优化实现论文汇总

http://site.douban.com/204776/widget/notes/12599608/note/286923972/

觉得比较有实际应用上的价值,因为文本数量有时候很多,实现上的优化就很必要了。

快速推理算法:

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

在线学习:

Online Learning for Latent Dirichlet Allocation

http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf

http://videolectures.net/nips2010_hoffman_oll/

www.ece.duke.edu/~lcarin/Lingbo4.15.2011.pptx

文本流的推理算法;

Topic models over text streams: a study of batch and online unsupervised learning

Efficient Methods for Topic Model Inference on Streaming Document Collections

分布式学习:

Distributed Inference for Latent Dirichlet Allocation

PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing

#LDA应用

——LDA应用变形

http://site.douban.com/204776/widget/notes/12599608/note/286930572/

说说LDA在不同应用上的几个变形,都有细微调整也都带来了新的问题。

##情感分析

Opinion Integration Through Semi-supervised Topic Modeling

把传统的Topic Model作为非监督的典型,拓展成了半监督。加入了模型的先验信息,对于一些汽车产品,从维基百科中提出它的各个特征的描述,然后训练成先验信息。

Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

联合抽取主题和观点。引入监督学习的方法,区分主题和情感词汇。进一步再用LDA进行聚类。

##学术挖掘

比如KDD2013今年也有的作者建模,再比如学术热点探测……

The author-topic model for authors and documents

同时对作者和主题进行建模。每个作者再限定该作者只能对应一个主题,每个作者也是主题上的一个分布,同时用作者~主题的分布取代文档~主题的分布。

Joint latent topic models for text and citations

对主题和引用同事建模,建立引用关系链接。

Detecting Topic Evolution in Scientific Literature: How Can Citations Help?

通过引用信息,建立主题进化模型。

##社会媒体主题

Twitter的研究太多了,小站SNA部分也总结过很多了。不多写了。

#LDA工具库

——LDA工具库

http://site.douban.com/204776/widget/notes/12599608/note/287084873/

(这部分还缺R,等我自己用过再做评价)

先发一个格式比较好的链接(但不全):

http://mengjunxie.github.io/ae-lda/topic-modeling.html


####

Latent Dirichlet allocation

http://www.cs.princeton.edu/~blei/lda-c/

This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see
the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .

####

Discrete Component Analysis

http://www.nicta.com.au/people/buntinew/discrete_component_analysis

The Discrete Component Analysis (DCA) software is being developed as a stand-alone package, and as a plug-in to the Elefant system, a machine learning toolbox from NICTA. Currently the software is being run in stand-alone mode using the data streaming libraries
from the older and now unsupported MPCA system, developed at Helsinki Institute for IT. The software itself is written in the C language and compiles on a Linux and a Mac OS X environment.

The models presented here are known under many names, such as latent Dirichlet allocation, multi-aspect models, multinomial PCA, and non-negative matrix factorisation.

####

Infinite LDA

http://www.arbylon.net/projects/knowceans-ilda/readme.txt

https://bitbucket.org/gchrupala/colada/wiki/Resources

Implementations of Latent Dirichlet Allocation (LDA) and

Hierarchical Dirichlet Processes (HDP)

@author Gregor Heinrich, gregor :: arbylon : net

@version 0.96

@date 1 Mar 2011

- History: ILDA version 0.1: May 2008, LDA version 0.1: Feb. 2005, based

on http://arbylon.net/projects/LdaGibbsSampler.java

- Simple implementations of Gibbs sampling for LDA and HDP

- Scientific documentation: see texts lda.pdf and ilda.pdf

- Technical documentation: see Javadoc and source (packages *.corpus and

*.utils are from knowceans-tools on SourceForge)

- Data documentation: see nips/readme.txt including source references

- License: All code is licensed under GPL v3.0.

- If the code is used in scientific work, please refer to its source

via the URL:

http://arbylon.net/projects/knowceans-ilda.zip

or the documentation of the ILDA or LDA implementations:

G. Heinrich. "Infinite LDA" -- implementing the HDP with minimum code

complexity. TN2011/1, http://arbylon.net/publications/ilda.pdf, 2011

G. Heinrich. Parameter estimation for text analysis. Technical report,

No. 09RP008-FIGD, Fraunhofer IGD, 2009

TODO:

- Diverse checks, e.g., Antoniak distribution sampling, hyperparameter

estimators, general quantitative validation of HDP model

- Output formatting

- Visual matrix implementation for HDP / IldaGibbs

####

MAchine Learning for LanguagE Toolkit

http://mallet.cs.umass.edu/

MALLET is open source software [License]. For research use, please remember to cite MALLET.

Download MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance
using several commonly used metrics.

####

Multithreaded LDA

https://sites.google.com/site/rameshnallapati/software

Multithreaded extension of Blei's LDA implementation. C Ramesh Nallapati Speeds up the computation by orders of magnitude depending on the number of processors.

####

GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation

https://sites.google.com/site/rameshnallapati/software

GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large
collections of text/Web documents. LDA was first introduced by David Blei et al [Blei03]. There have been several implementations of this model in C (using Variational Methods), Java, and Matlab. We decided to release this implementation of LDA in C/C++ using
Gibbs Sampling to provide an alternative to the topic-model community.

GibbsLDA++ is useful for the following potential application areas:

Information retrieval and search (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search).

Document classification/clustering, document summarization, and text/web mining community in general.

Content-based image clustering, object recognition, and other applications of computer vision in general.

Other potential applications in biological data.

####

Gensim

http://radimrehurek.com/gensim/

Gensim is a FREE Python library

Scalable statistical semantics

Analyze plain-text documents for semantic structure

Retrieve semantically similar documents

####

Stanford Topic Modeling Toolbox

http://nlp.stanford.edu/software/tmt/tmt-0.4/

The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to:

Import and manipulate text from cells in Excel and other spreadsheets.

Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text.

Select parameters (such as the number of topics) via a data-driven process.

Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data.

The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by:

Daniel Ramage and Evan Rosen, first released in September 2009.

####

Matlab Topic Modeling Toolbox 1.4

http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

Installation & Licensing

Download the zipped toolbox (18Mb).

NOTE: this toolbox now works with 64 bit compilers. If you are looking for the old version of this toolbox that has the code for 32 bit compilers, download this version

The program is free for scientific use. Please contact the authors, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. By using this software, you are agreeing
to this license statement.

Type 'help function' at command prompt for more information on each function

Read these notes on data format for a description on the input and output format for the different topic models

Note for MAC and Linux users: some of the Matlab functions are implemented with mex code (C code linked to Matlab). For windows based platforms, the dll's are already provided in the distribution package. For other platforms, please compile the mex functions
by executing "compilescripts" at the Matlab prompt

#

最后的最后,

发个Topic Modeling Bibliography

http://www.cs.princeton.edu/~mimno/topics.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: