Lemur(狐猴)一个用于自然语言模型和信息检索研究的系统
2012-04-09 12:13
127 查看
介绍: Lemur(狐猴)系统是CMU和UMass联合推出的一个用于自然语言模型和信息检索研究的系统。在这个系统上可以实现基于自然语言模型和传统的向量空 间模型以及Okapi的ad hoc或者分布式检索,可以使用结构化查询,跨语言检索,过滤,聚类等等。目前最新的版本是3.0,CMU和UMass在9月将推出新的版本 Indri(大狐猴),将加入支持terabyte(1000G就是1T)的数据库和结构化的文档查询(比如将html文档解析为不同的doc representation方式,利用html文档的结构表达方式信息tag, title, meta等)。 运行Lemur需要什么?Lemur可以在windows或者Unix环境下使用,因此我们可以直接在windows下使用lemur。但是lemur提 供了shell script文件来演示完整的使用lemur进行检索的过程,所以在windows下需要安装cygwin来模拟Unix环境。Lemur还提供了一个 GUI程序以及用户交互的界面的CGI,其中有Java程序可以直接看到检索的结果,,因此需要安装Java 虚拟机,CGI程序需要Perl的解释器 下载网址: http://www.lemurproject.org/ 双击lemur,可以看到4.3到最新版本;
目录介绍:..\Lemur 4.9\ bin\ Lemur Toolkit applications供 直接调用的应用程序脚本即命令行方式,详见 windoc\lemur-applications.html include\ The lemur include files lib\ the lemur library windoc\ Overview of the Lemur Toolkit Overview of the Lemur Toolkit Installed Applications Using the Lemur Toolkit API Indexing Indexing Overview Document Formats Retrieval Batch Retrieval Methods The Indri Query Language for Retrieval The InQuery Query Language for Retrieval src_vs_2005\ 基于MS平台的完整Lemur Toolkit源码 javadoc\ java API document GUI\ RetUI.jar provides a basic document retrieval GUI for interactive queries, using the Indri API. IndexUI.jar provides a basic collection indexing GUI for building an indri repository. LemurRet.jar provides a basic document retrieval GUI for interactive queries using the Lemur API. LemurIndex.jar provides a basic collection indexing GUI for building Lemur indexes. lemur.jar and indri.jar for the Lemur and Indri APIS. doc\ Lemur Toolkit Documentation 如: Namespace List | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Namespace Members | Class Members | File Members | Related Pages CSharp\ The C# wrapper classes assembly will be in LemurCsharp.dll This assembly should be referenced by your C# program. 使用方式: (1)直接拿lemur的程序来使用,即bin\下的可执行程序; (2)Building applications using Visual Studio .NET即直接在自己的项目中调用Lemur库等; After installing the lemur toolkit, you can use the library by adding the subfolder include of the target directory to the "C/C++ / General / Additional Include Directories" property for your project: Next, add the subfolder lib of the target directory to the "Linker / General / Additional Library Directories" property for your project: Next, add lemur.lib and wsock32.lib to the "Linker / Input / Additional Dependencies" property for your project. Also, if your project is configured as "Debug", you should choose the "Multi-threaded Debug DLL(/MDd)" runtime library. If your project is configured as "Release", you should choose the "Multi-threaded DLL(/MD)" runtime library. The installable Lemur Library and applications were built in Release / Multi-Threaded mode. Finally, you should have C/C++ Language Enable Run-Time Type Info set to yes. (3)Compiling the Lemur Toolkit with Visual Studio .NET即对lemur进行修改以符合自己的要求,然后重新编译再调用; The installer can optionally install the full Lemur Toolkit source tree, placing it in the "src_vs_2003" subfolder and/or the "src_vs_2005" subfolder of the target directory, depending on which version(s) of Visual Studio you have installed. That folder contains the Visual Studio solution file "Lemur.sln". There is a separate project file for each library and for each application in Lemur. By default the project configurations are built in "Debug" mode. To change this so that it compiles with fewer warnings and runs at higher efficiency, change the configuration setting in the "Build" menu. Then choose "Configuration Manager". In the menu for "Active Solution Configuration", choose "Release". When built from source, there is a separate library for each of the sub-libraries that are compiled into "lemur.lib". The combined library, "lemur.lib", is built in the lemur subfolder, with output in either Release or Debug, depending on configuration. Important Note: 1。Before compiling the toolkit from the source, you must set the proper include path for the Java library. To modify the library, in the Solution Explorer view, right-click on the "lemur_jni" project and choose "Properties". Set the "Configuration" drop-down box (at the top of the dialog box) to "All Configurations". Next, in the "Additional Include Directories" field, set the appropriate paths to your Java JDK installation's include directory and include/win32 directory. Press the "OK" button when finished, and rebuild. [如果依然不能找到file: 'jni.h',则分别将JDK的include和win32也加入到Additional Include Directories] 2。防止出现类似 error PRJ0008 : 未能删除文件“e:\lemur 4.8\src_vs_2005\app\obj\vc80.pdb”或者不能打开等, 进行设置:即parallel project builds 问题,设maximum number of parallel project builds为1。(双核以上CPU问题?) 3。因为lemur有对于阿拉伯文的支持,而在中文系统当中可能会出现字符编码的问题。所以,需要屏蔽掉涉及到阿拉伯文处理的模块。找到parsing模 块下的Arabic_Stemmer.cpp文件,将其中的函数内容全部屏蔽为空。对于返回类型为void型函数,将函数体内容全部注释,对于有返回类型 的函数将整个函数全部注释掉。注意,这里不可删除模块的内容,因为其它的模块会调用相关的接口,如果屏蔽掉接口会导致程序无法通过编译。 使用参考文档: Lemur Toolkit and Indri Search Engine Documentation http://www.lemurproject.org/docs/index.php/Main_Page 主要内容: Where to Begin... Overview Compiling and Installing Technical Details Using the Toolkit Toolkit Usage Overview Building Indexes Retrieval Tasks Lemur Toolkit Utilities The Indri Query Language The Lemur CGI Application Programming with the Toolkit Using the Lemur Toolkit with C/C++ Using the Lemur Toolkit with C Sharp Using the Lemur Toolkit with Java Extending the Toolkit Libraries Lemur and Indri for Multilingual Tasks Multilingual Overview Lemur/Indri and Chinese Text Lemur/Indri and Arabic Text Reference Table of Contents The Lemur Toolkit API documentation Site Index from: http://hi.baidu.com/gengshenspirit/blog reference: http://blog.csdn.net:80/NewNebuladream |
相关文章推荐
- 一个并行计算系统的初级模型
- 谷歌工程师利用和语言翻译类似的技术开发出了一个用于翻译图片主题的机器学习算法
- 自然语言期末复习笔记—最大熵模型
- 一个粗略的用于计算IT技术图书收益的数学模型
- mosh:一个基于 SSH 用于连接远程 Unix/Linux 系统的工具
- 应用于深度学习和自然语言处理的注意机制和记忆模型
- 第六章:通过mvc模型设计一个简单的留言系统
- 谷歌开源可视化工具Facets,将用于人+AI协作项目研究——无非就是一个用于特征工程探索的绘图工具集,pandas可以做的
- 浅层深度学习的自然语言研究(4)
- 使用静态化技术写了一个用于在手机上看的博客系统(目前只能看文章,写不了评论)
- 一个用于分析表结构的查询语句和几个系统存储过程
- [学术论文]从轻量级形式化方法出发的需求建模——用Radl语言对MIS系统进行规范描述的案例研究
- 信息安全模型 的研究及安全系统方案设计
- 记 Install VNC On RaspberryOS During 创新实训 自然语言交流系统
- dblp 介绍 使用,计算机领域内对研究的成果以作者为核心的一个计算机类英文文献的集成数据库系统
- NLTK:一个先进的用来处理自然语言数据的Python程序
- 一个系统模型设计
- ice Slice语言 定义一个分布式文件系统
- 有学生提到,在大学选课的时候,可以写一个“刷课机”的程序,利用学校选课系统的弱点或漏洞,帮助某些人选到某些课程。或者帮助用户刷购票网站,先买到火车票。这些软件合法么?符合道德规范么?是在“软件工程”的研究范围么?
- LSTM模型在问答系统中的应用 2017-06-27 21:03 在问答系统的应用中,用户输入一个问题,系统需要根据问题去寻找最合适的答案。 1:采用句子相似度的方式。根据问题的字面相似度选择相似度最