您的位置:首页 > 其它

认识Lucene(6):使用Lucene、LingPipe做实体链接(Entity Linking)——使用了LingPipe构建entity索引

2016-01-13 22:53 501 查看
上一篇做的工作是:使用Lucene构建歧义实体映射index、歧义实体上下文index

这是还差entities的index,否则怎么查entities呢!

LingPipe是个天然的entities recognise工具,有很多用法,具体参考官网,文末给出了链接。

使用LingPipe构建entities的index不多说了,直接上代码:

//entityDictionaryChunkerFF, LingPipe
	//index all entitys
	public static void BuildEntityDictionary() throws Exception
	{
		double CHUNK_SCORE = 1.0;
		//String entityPath="E:/LuceneDocument/long_abstracts_preprocessing_entity(file_contents_examples).txt";
		String entityPath="E:/LuceneDocument/long_abstracts_preprocessing_entity.txt";
		
		MapDictionary<String> dictionary = new MapDictionary<String>();
		
		FileReader fr=new FileReader(entityPath);
        BufferedReader br=new BufferedReader(fr);
        String entity="";
        int i=0;
        while ((entity=br.readLine())!=null) 
        {
        	i++;
        	if(i>500000) //共有463万,这里只取前100万
        	{
        		break;
        	}
            System.out.println(i+"=>"+entity);
            dictionary.addEntry(new DictionaryEntry<String>(entity,"DBpedia_entity",CHUNK_SCORE));
        }
        br.close();
        fr.close();
        
        entityDictionaryChunkerFF = new ExactDictionaryChunker(dictionary,
                                         IndoEuropeanTokenizerFactory.INSTANCE,
                                         false,false); 
        //All matches is false, Case sensitive is false
        //difference see http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html         //FF can recognize "German Empire", but TF can't 

        System.out.println("dictionary size:\n" + dictionary.size());
	}


有了entities的index,就可以做entities linking了,参考下一篇。

参考文献:

[1] Mendes, Pablo N, Jakob, Max, Garc&#, et al. DBpedia spotlight: Shedding light on the web of documents[C]// Proceedings of the 7th International Conference on Semantic Systems. ACM, 2011:1-8.

[2] Han X, Sun L. A Generative Entity-Mention Model for Linking Entities with Knowledge Base.[J]. Proceeding of Acl, 2011:945-954.

[3] http://lucene.apache.org/

[4] http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

[5] http://wiki.dbpedia.org/Downloads2014

[6] http://www.oschina.net/p/jieba(结巴分词)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: