探索Lucene.Net全文检索(续)
2015-01-07 10:53
162 查看
这几天一直在研究Lucene.Net 建索引,排序的问题。
需求:
1.支持根据泛类型List<T>建索引,并保证在T字段类型为String的时候才能允许分词(Field.Index.ANALYZED),否则就不允许分词(Field.Index.NOT_ANALYZED),这里做的目的是方便排序。
2.支持根据DataTable 建索引,并保证在DataTable字段类型为String的时候才能允许分词(Field.Index.ANALYZED),否则就不允许分词(Field.Index.NOT_ANALYZED),这里做的目的是方便排序。
3.支持单字段精确分页查询,返回List<T>类型。
4.支持多字段精确分页查询,返回List<T>类型。
5.支持多字段精确模糊查询,返回List<T>类型。
6.支持多字段模糊查询按照时间 任意字段排序,返回List<T>类型。
这里说明一下为什么要用泛类型传值和返回值。
主要原因是因为扩展方便:
1.建索引扩展方便,如果使用者需要添加更多的字段,就只需要自定义Model就能实现,不需要修改类库。
2.查询扩展方便。
就说这么多吧。上代码。
LuceneSearch.cs Lucene操作类库
DataIndexer.cs 数据索引库
说明:这里是根据Java Lucene3.0代码改写的。感觉不是特别完善,但是够用了。
MySortComparatorSource.cs
DateValComparator.cs
需求:
1.支持根据泛类型List<T>建索引,并保证在T字段类型为String的时候才能允许分词(Field.Index.ANALYZED),否则就不允许分词(Field.Index.NOT_ANALYZED),这里做的目的是方便排序。
2.支持根据DataTable 建索引,并保证在DataTable字段类型为String的时候才能允许分词(Field.Index.ANALYZED),否则就不允许分词(Field.Index.NOT_ANALYZED),这里做的目的是方便排序。
3.支持单字段精确分页查询,返回List<T>类型。
4.支持多字段精确分页查询,返回List<T>类型。
5.支持多字段精确模糊查询,返回List<T>类型。
6.支持多字段模糊查询按照时间 任意字段排序,返回List<T>类型。
这里说明一下为什么要用泛类型传值和返回值。
主要原因是因为扩展方便:
1.建索引扩展方便,如果使用者需要添加更多的字段,就只需要自定义Model就能实现,不需要修改类库。
2.查询扩展方便。
就说这么多吧。上代码。
LuceneSearch.cs Lucene操作类库
using System; using System.Collections.Generic; using System.Linq; using System.Web; using Lucene.Net.Index; using Lucene.Net.Store; using Lucene.Net.Analysis.Standard; using Lucene.Net.Documents; using Lucene.Net.Search; using Lucene.Net.QueryParsers; using System.Data; using Lucene.Net.Search.Highlight; using Lucene.Net.Analysis; using System.IO; using System.Reflection; namespace QueryLucene { /// <summary> /// Lucene搜索 /// mofijeck /// 20150104 /// </summary> public class LuceneSearch { private string indexDirectory = System.Web.HttpContext.Current.Server.MapPath("~/App_Data/index");//默认 /// <summary> /// 空构造函数 /// </summary> public LuceneSearch() { } /// <summary> /// 构造函数 /// </summary> /// <param name="filePath">为空则为~/App_Data/index</param> public LuceneSearch(string filePath) { if (!string.IsNullOrEmpty(filePath)) { indexDirectory = System.Web.HttpContext.Current.Server.MapPath(filePath); } } #region 索引操作 /// <summary> /// 文件索引 /// </summary> /// <param name="url">文件路径</param> /// <param name="pattenRegex">正则匹配 比如*.htm*</param> public void CreatIndex(string url,string pattenRegex) { string dataDirectory = System.Web.HttpContext.Current.Server.MapPath(url); IntranetIndexer indexer = new IntranetIndexer(indexDirectory); indexer.AddDirectory(new DirectoryInfo(dataDirectory), pattenRegex); indexer.Close(); } /// <summary> /// 创建数据库索引List /// </summary> public void CreatIndexByData<T>(List<T> list) { DataIndexer indexer = new DataIndexer(indexDirectory); indexer.AddHtmlData(list); indexer.Close(); } /// <summary> /// 创建数据库索引DataTable /// </summary> /// public void CreatIndexByData(DataTable dt) { DataIndexer indexer = new DataIndexer(indexDirectory); indexer.AddHtmlData(dt); indexer.Close(); } /// <summary> /// 更新数据库索引 /// </summary> public void UpdateIndexByData<T>(List<T> list) { DataIndexer indexer = new DataIndexer(indexDirectory, false); indexer.AddHtmlData(list); indexer.Close(); } /// <summary> /// 更新数据库索引 /// </summary> public void UpdateIndexByData(DataTable dt) { DataIndexer indexer = new DataIndexer(indexDirectory, false); indexer.AddHtmlData(dt); indexer.Close(); } /// <summary> /// 按照唯一ID值删除索引 /// </summary> /// <param name="id"></param> public void deleteHtmlDocument(int id) { DataIndexer indexer = new DataIndexer(indexDirectory, false); indexer.deleteHtmlDocument(id); indexer.Close(); } /// <summary> /// 根据内容更新索引 最好是根据唯一ID值 /// </summary> /// <typeparam name="T"></typeparam> /// <param name="lt"></param> /// <param name="colName"></param> /// <param name="colValue"></param> public void updateHtmlDocument<T>(T lt, string colName, string colValue) { DataIndexer indexer = new DataIndexer(indexDirectory, false); indexer.updateHtmlDocument<T>(lt, colName, colValue); indexer.Close(); } /// <summary> /// 删除索引 /// </summary> public void DeleteIndex() { DataIndexer indexer = new DataIndexer(indexDirectory); indexer.Delete(); indexer.Close(); } #endregion #region 搜索 #region 单字段搜索 /// <summary> /// 单字段搜索 这里最好是精确搜索 /// 方法举例: /// search("元旦到了", "title"); /// </summary> /// <typeparam name="T">搜索类</typeparam> /// <param name="q">查询字段</param> /// <param name="colname">搜索字段名称</param> /// <param name="pageSize">分页字段</param> /// <param name="page">当前页面</param> /// <returns></returns> public List<T> Search<T>(string q, string colname, int pageSize, int page) { List<T> list = new List<T>(); // create the searcher // index is placed in "index" subdirectory DateTime start = DateTime.Now; var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, colname, analyzer); Query query = parser.Parse(q); // search TopDocs hits = searcher.Search(query, 200); int count = hits.TotalHits; // create highlighter IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>"); SimpleFragmenter fragmenter = new SimpleFragmenter(80); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.TextFragmenter = fragmenter; // initialize page int startPage = page; int endPage = (page + 1) * pageSize > count ? count : (page + 1) * pageSize; // how many items we should show - less than defined at the end of the results foreach (ScoreDoc sd in hits.ScoreDocs) { Document doc = searcher.Doc(sd.Doc); TokenStream stream = analyzer.TokenStream("", new StringReader(doc.Get(colname))); String highText = highlighter.GetBestFragments(stream, doc.Get(colname), 2, "..."); Type type = typeof(T); T t = Activator.CreateInstance<T>(); foreach (PropertyInfo p in type.GetProperties()) { //doc.Add(new Field(p.Name, p.GetValue(lt, null).ToString(), Field.Store.YES, Field.Index.ANALYZED)); if (p.Name == colname) { p.SetValue(t, highText, null); } else { p.SetValue(t, doc.Get(p.Name), null); } } list.Add(t); } searcher.Dispose(); return list; } public int getSearchCount(string q, string colname) { var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search var parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_30, colname, analyzer); Query query = parser.Parse(q); // search TopDocs hits = searcher.Search(query, 200); return hits.TotalHits; } #endregion #region 多字段精确查询 /// <summary> /// /// </summary> /// <typeparam name="T">搜索类</typeparam> /// <param name="q">查询字段</param> /// <param name="colname">搜索字段名称</param> /// <param name="pageSize">分页字段</param> /// <param name="page">当前页面</param> /// <returns></returns> public List<T> Search<T>(string[] q, string[] colname, int pageSize, int page) { List<T> list = new List<T>(); // create the searcher // index is placed in "index" subdirectory DateTime start = DateTime.Now; var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search Occur[] occ = new Occur[colname.Length]; for (int i = 0; i < colname.Length; i++) { occ[i] = Occur.MUST; } Query query = MultiFieldQueryParser.Parse(Lucene.Net.Util.Version.LUCENE_30, q, colname, occ, analyzer); // search TopDocs hits = searcher.Search(query, 200); int count = hits.TotalHits; // create highlighter IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>"); SimpleFragmenter fragmenter = new SimpleFragmenter(80); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.TextFragmenter = fragmenter; // initialize page int startPage = page; int endPage = (page + 1) * pageSize > count ? count : (page + 1) * pageSize; // how many items we should show - less than defined at the end of the results foreach (ScoreDoc sd in hits.ScoreDocs) { Document doc = searcher.Doc(sd.Doc); Type type = typeof(T); T t = Activator.CreateInstance<T>(); foreach (PropertyInfo p in type.GetProperties()) { if (Array.IndexOf(colname, p.Name) < 0) { p.SetValue(t, doc.Get(p.Name), null); } else { TokenStream stream = analyzer.TokenStream("", new StringReader(doc.Get(p.Name))); String highText = highlighter.GetBestFragments(stream, doc.Get(p.Name), 2, "..."); p.SetValue(t, highText, null); } } list.Add(t); } searcher.Dispose(); return list; } public int getSearchCount(string[] q, string[] colname) { var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search Occur[] occ = new Occur[colname.Length]; for (int i = 0; i < colname.Length; i++) { occ[i] = Occur.MUST; } Query query = MultiFieldQueryParser.Parse(Lucene.Net.Util.Version.LUCENE_30, q, colname, occ, analyzer); // search TopDocs hits = searcher.Search(query, 200); return hits.TotalHits; } #endregion #region 多字段模糊查询 /// <summary> /// 多字段模糊搜索 /// string[] stringQuery = { "test", "*旦*" }; 带**则为模糊搜索 否则则是精确搜索 /// string[] fields = { "text", "title" }; /// searchLike(stringQuery, fields); /// </summary> /// <typeparam name="T">搜索类</typeparam> /// <param name="q">查询字段 模糊搜索:{"*搜索内容*"} 精确搜索:{"搜索内容"}</param> /// <param name="colname">搜索字段名称 {"field"}</param> /// <param name="pageSize">分页字段</param> /// <param name="page">当前页面</param> /// <returns></returns> public List<T> SearchLike<T>(string[] q, string[] colname, int pageSize, int page) { List<T> list = new List<T>(); // create the searcher // index is placed in "index" subdirectory DateTime start = DateTime.Now; var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search BooleanQuery query = new BooleanQuery(); int coli = 0; foreach (string col in colname) { query.Add(new WildcardQuery(new Term(col, q[coli])), Occur.MUST); coli++; } // search TopDocs hits = searcher.Search(query, searcher.MaxDoc); int count = hits.TotalHits; // create highlighter IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>"); SimpleFragmenter fragmenter = new SimpleFragmenter(80); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.TextFragmenter = fragmenter; // initialize page int startPage = (page-1) * pageSize;//开始记录 int endPage = page * pageSize;//结束记录 if (endPage > count) { endPage = count; } for (int i = startPage; i < endPage; i++) { ScoreDoc sd = hits.ScoreDocs[i]; Document doc = searcher.Doc(sd.Doc); Type type = typeof(T); T t = Activator.CreateInstance<T>(); foreach (PropertyInfo p in type.GetProperties()) { if (Array.IndexOf(colname, p.Name) < 0) { p.SetValue(t, doc.Get(p.Name), null); } else { TokenStream stream = analyzer.TokenStream("", new StringReader(q[Array.IndexOf(colname, p.Name)])); String highText = highlighter.GetBestFragments(stream, doc.Get(p.Name), 2, "..."); p.SetValue(t, highText, null); } } list.Add(t); } searcher.Dispose(); return list; } public int getSearchCountLike(string[] q, string[] colname) { var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search BooleanQuery query = new BooleanQuery(); int i = 0; foreach (string col in colname) { query.Add(new WildcardQuery(new Term(col, q[i])), Occur.MUST); i++; } // search TopDocs hits = searcher.Search(query, 1/*最多返回结果*/); return hits.TotalHits; } #endregion #region 多字段模糊查询按照时间 任意字段排序 //public const int BYTE = 10; //public const int CUSTOM = 9; //public const int DOC = 1; //public const int DOUBLE = 7; //public const int FLOAT = 5; //public const int INT = 4; //public const int LONG = 6; //public const int SCORE = 0; //public const int SHORT = 8; //public const int STRING = 3; //public const int STRING_VAL = 11; /// <summary> /// /// </summary> /// <typeparam name="T"></typeparam> /// <param name="q"></param> /// <param name="colname"></param> /// <param name="rangeCol"></param> /// <param name="startTime"></param> /// <param name="endTime"></param> /// <param name="sortcol"></param> /// <param name="sortType">BYTE CUSTOM DOC DOUBLE FLOAT INT LONG SCORE SHORT STRING STRING_VAL</param> /// <param name="desc"></param> /// <param name="pageSize"></param> /// <param name="page"></param> /// <returns></returns> public List<T> SearchLike<T>(string[] q, string[] colname, string rangeCol, DateTime? startTime, DateTime? endTime, string sortcol,string sortType, bool desc, int pageSize, int page) { List<T> list = new List<T>(); // create the searcher // index is placed in "index" subdirectory DateTime start = DateTime.Now; var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search BooleanQuery query = new BooleanQuery(); //时间范围 bool isstartdate = startTime.HasValue, isenddate = endTime.HasValue; if (isstartdate || isenddate) { string tstart=isstartdate==true? Convert.ToDateTime(startTime).ToString("yyyy-MM-dd"):DateTime.MinValue.ToString("yyyy-MM-dd"); //时间范围查询 string tend = isenddate == true ? Convert.ToDateTime(endTime).ToString("yyyy-MM-dd") : DateTime.MaxValue.ToString("yyyy-MM-dd"); Query TimeRangequery = new TermRangeQuery(rangeCol, tstart, tend, true, true); query.Add(TimeRangequery, Occur.MUST); } int coli = 0; foreach (string col in colname) { query.Add(new WildcardQuery(new Term(col, q[coli])), Occur.MUST); coli++; } // 索引排序条件 Sort sort = new Sort(); SortField sf = new SortField(sortcol, getSortField(sortType)/*SortField.STRING*/ , desc);//true表示逆序 sort.SetSort(sf); // search TopDocs hits = searcher.Search(query, null, searcher.MaxDoc, sort); int count = hits.TotalHits; // create highlighter IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>"); SimpleFragmenter fragmenter = new SimpleFragmenter(80); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.TextFragmenter = fragmenter; // initialize page int startPage = (page - 1) * pageSize;//开始记录 int endPage = page * pageSize;//结束记录 if (endPage > count) { endPage = count; } for (int i = startPage; i < endPage; i++) { ScoreDoc sd = hits.ScoreDocs[i]; Document doc = searcher.Doc(sd.Doc); Type type = typeof(T); T t = Activator.CreateInstance<T>(); foreach (PropertyInfo p in type.GetProperties()) { if (Array.IndexOf(colname, p.Name) < 0) { p.SetValue(t, doc.Get(p.Name), null); } else { TokenStream stream = analyzer.TokenStream("", new StringReader(q[Array.IndexOf(colname, p.Name)])); String highText = highlighter.GetBestFragments(stream, doc.Get(p.Name), 2, "..."); p.SetValue(t, highText, null); } } list.Add(t); } searcher.Dispose(); return list; } public int getSearchCountLike(string[] q, string[] colname, string rangeCol, DateTime? startTime, DateTime? endTime) { // create the searcher // index is placed in "index" subdirectory var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search BooleanQuery query = new BooleanQuery(); //时间范围 bool isstartdate = startTime.HasValue, isenddate = endTime.HasValue; if (isstartdate || isenddate) { string tstart = isstartdate == true ? Convert.ToDateTime(startTime).ToString("yyyy-MM-dd") : DateTime.MinValue.ToString("yyyy-MM-dd"); //时间范围查询 string tend = isenddate == true ? Convert.ToDateTime(endTime).ToString("yyyy-MM-dd") : DateTime.MaxValue.ToString("yyyy-MM-dd"); Query TimeRangequery = new TermRangeQuery(rangeCol, tstart, tend, true, true); query.Add(TimeRangequery, Occur.MUST); } int i = 0; foreach (string col in colname) { query.Add(new WildcardQuery(new Term(col, q[i])), Occur.MUST); i++; } // 索引排序条件 //Sort sort = new Sort(); //SortField sf = new SortField(sortcol, SortField.STRING, desc);//true表示逆序 //sort.SetSort(sf); // search TopDocs hits = searcher.Search(query,1); int count = hits.TotalHits; return count; } #endregion #region 多字段查询按照时间 任意字段排序 /// <summary> /// 多字段查询按照时间 任意字段排序 /// </summary> /// <typeparam name="T"></typeparam> /// <param name="q"></param> /// <param name="colname"></param> /// <param name="rangeCol"></param> /// <param name="startTime"></param> /// <param name="endTime"></param> /// <param name="sortcol"></param> /// <param name="sortType">BYTE CUSTOM DOC DOUBLE FLOAT INT LONG SCORE SHORT STRING STRING_VAL</param> /// <param name="desc"></param> /// <param name="pageSize"></param> /// <param name="page"></param> /// <returns></returns> public List<T> Search<T>(string[] q, string[] colname, string rangeCol, DateTime? startTime, DateTime? endTime, string sortcol, string sortType, bool desc, int pageSize, int page) { List<T> list = new List<T>(); // create the searcher // index is placed in "index" subdirectory DateTime start = DateTime.Now; var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search BooleanQuery Booleanquery = new BooleanQuery(); Occur[] occ = new Occur[colname.Length]; for (int i = 0; i < colname.Length; i++) { occ[i] = Occur.MUST; } Query query = MultiFieldQueryParser.Parse(Lucene.Net.Util.Version.LUCENE_30, q, colname, occ, analyzer); Booleanquery.Add(query, Occur.MUST); //时间范围 if (!string.IsNullOrEmpty(rangeCol)) { bool isstartdate = startTime.HasValue, isenddate = endTime.HasValue; if (isstartdate || isenddate) { string tstart = isstartdate == true ? Convert.ToDateTime(startTime).ToString("yyyy-MM-dd") : DateTime.MinValue.ToString("yyyy-MM-dd"); //时间范围查询 string tend = isenddate == true ? Convert.ToDateTime(endTime).ToString("yyyy-MM-dd") : DateTime.MaxValue.ToString("yyyy-MM-dd"); Query TimeRangequery = new TermRangeQuery(rangeCol, tstart, tend, true, true); Booleanquery.Add(TimeRangequery, Occur.MUST); } } // 索引排序条件 Sort sort = new Sort(); SortField sf = new SortField(sortcol, getSortField(sortType)/*SortField.STRING*/ , desc);//true表示逆序 sort.SetSort(sf); // search TopDocs hits = searcher.Search(Booleanquery, null, searcher.MaxDoc, sort); int count = hits.TotalHits; // create highlighter IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold;\">", "</span>"); SimpleFragmenter fragmenter = new SimpleFragmenter(80); QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.TextFragmenter = fragmenter; // initialize page int startPage = (page - 1) * pageSize;//开始记录 int endPage = page * pageSize;//结束记录 if (endPage > count) { endPage = count; } for (int i = startPage; i < endPage; i++) { ScoreDoc sd = hits.ScoreDocs[i]; Document doc = searcher.Doc(sd.Doc); Type type = typeof(T); T t = Activator.CreateInstance<T>(); foreach (PropertyInfo p in type.GetProperties()) { if (Array.IndexOf(colname, p.Name) < 0) { p.SetValue(t, doc.Get(p.Name), null); } else { TokenStream stream = analyzer.TokenStream("", new StringReader(doc.Get(p.Name))); String highText = highlighter.GetBestFragments(stream, doc.Get(p.Name), 2, "..."); p.SetValue(t, highText, null); } } list.Add(t); } searcher.Dispose(); return list; } public int getSearchCount(string[] q, string[] colname, string rangeCol, DateTime? startTime, DateTime? endTime) { var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); IndexSearcher searcher = new IndexSearcher(FSDirectory.Open(indexDirectory)); // parse the query, "text" is the default field to search BooleanQuery Booleanquery = new BooleanQuery(); Occur[] occ = new Occur[colname.Length]; for (int i = 0; i < colname.Length; i++) { occ[i] = Occur.MUST; } Query query = MultiFieldQueryParser.Parse(Lucene.Net.Util.Version.LUCENE_30, q, colname, occ, analyzer); Booleanquery.Add(query, Occur.MUST); //时间范围 if (!string.IsNullOrEmpty(rangeCol)) { bool isstartdate = startTime.HasValue, isenddate = endTime.HasValue; if (isstartdate || isenddate) { string tstart = isstartdate == true ? Convert.ToDateTime(startTime).ToString("yyyy-MM-dd") : DateTime.MinValue.ToString("yyyy-MM-dd"); //时间范围查询 string tend = isenddate == true ? Convert.ToDateTime(endTime).ToString("yyyy-MM-dd") : DateTime.MaxValue.ToString("yyyy-MM-dd"); Query TimeRangequery = new TermRangeQuery(rangeCol, tstart, tend, true, true); Booleanquery.Add(TimeRangequery, Occur.MUST); } } // search TopDocs hits = searcher.Search(Booleanquery, 200); return hits.TotalHits; } #endregion #region 获取SortFild常量 //public const int BYTE = 10; //public const int CUSTOM = 9; //public const int DOC = 1; //public const int DOUBLE = 7; //public const int FLOAT = 5; //public const int INT = 4; //public const int LONG = 6; //public const int SCORE = 0; //public const int SHORT = 8; //public const int STRING = 3; //public const int STRING_VAL = 11; protected int getSortField(string type) { int sortfieldType = 0; switch (type.ToUpper()) { case "BYTE": sortfieldType = 10; break; case "CUSTOM": sortfieldType = 9; break; case "DOC": sortfieldType = 1; break; case "DOUBLE": sortfieldType = 7; break; case "FLOAT": sortfieldType = 5; break; case "INT": sortfieldType = 4; break; case "LONG": sortfieldType = 6; break; case "SCORE": sortfieldType = 0; break; case "SHORT": sortfieldType = 8; break; case "STRING": sortfieldType = 3; break; case "STRING_VAL": sortfieldType = 11; break; default: sortfieldType = 3; break; } return sortfieldType; } #endregion #endregion } }
DataIndexer.cs 数据索引库
using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.IO; using System.Text.RegularExpressions; using Lucene.Net.Analysis.Standard; using Lucene.Net.Documents; using Lucene.Net.Index; using Lucene.Net.Store; using Lucene.Net.Util; using Lucene.Net.QueryParsers; using System.Reflection; using System.Data; namespace QueryLucene { public class DataIndexer { private IndexWriter writer; public DataIndexer(string directory) { writer = new IndexWriter(FSDirectory.Open(directory), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), true, IndexWriter.MaxFieldLength.LIMITED); writer.UseCompoundFile = true; } /// <summary> /// 初始化索引 create为true则是创建索引,false是不创建索引 /// </summary> /// <param name="directory"></param> /// <param name="create">true则是创建索引,false是不创建索引</param> public DataIndexer(string directory, bool create) { writer = new IndexWriter(FSDirectory.Open(directory), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), create, IndexWriter.MaxFieldLength.LIMITED); writer.UseCompoundFile = true; } public void AddHtmlData(DataTable dt) { foreach (DataRow dr in dt.Rows) { AddHtmlDocument(dt,dr); } } public void AddHtmlData<T>(List<T> list) { foreach (T t in list) { AddHtmlDocument(t); } } /// <summary> /// Loads, parses and indexes an HTML file. /// </summary> /// <param name="path"></param> public void AddHtmlDocument<T>(T lt) { Document doc = new Document(); //创建属性的集合 Type type=typeof(T); foreach (PropertyInfo p in type.GetProperties()) { string typep= p.PropertyType.Name.ToString(); if (typep == "String") { doc.Add(new Field(p.Name, p.GetValue(lt, null).ToString(), Field.Store.YES, Field.Index.ANALYZED)); } else { doc.Add(new Field(p.Name, p.GetValue(lt, null).ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); } } writer.AddDocument(doc); } public void AddHtmlDocument(DataTable dt,DataRow dr) { Document doc = new Document(); for(int i=0;i<dt.Columns.Count;i++){ string colName=dt.Columns[i].ColumnName; string colType = dt.Columns[i].DataType.Name.ToString(); if (colType == "String") { doc.Add(new Field(colName, dr[colName].ToString(), Field.Store.YES, Field.Index.ANALYZED)); } else { doc.Add(new Field(colName, dr[colName].ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED)); } } writer.AddDocument(doc); } /// <summary> /// 按照索引ID删除 /// </summary> /// <param name="id">ID唯一值</param> public void deleteHtmlDocument(int id) { Term term = new Term("id", id.ToString()); writer.DeleteDocuments(term); } public void updateHtmlDocument<T>(T lt, string colName, string colValue) { Term term = new Term(colName, colValue); Document doc = new Document(); //创建属性的集合 Type type = typeof(T); foreach (PropertyInfo p in type.GetProperties()) { doc.Add(new Field(p.Name, p.GetValue(lt, null).ToString(), Field.Store.YES, Field.Index.ANALYZED)); } writer.UpdateDocument(term, doc); writer.DeleteDocuments(term); } /// <summary> /// Optimizes and save the index. /// </summary> public void Close() { writer.Optimize(); writer.Dispose(); } public void Delete() { writer.DeleteAll(); } ///// <summary> ///// 使用举例 这里是更新Lucene datatable方式 ///// </summary> //protected void updateTableLucene() //{ // LuceneSearch ls = new LuceneSearch(); // DataTable t = new DataTable("学生表"); // t.Columns.Add("text", System.Type.GetType("System.String")); // t.Columns.Add("title", System.Type.GetType("System.String")); // for (int i = 0; i < 1000; i++) // { // DataRow r = t.NewRow(); // r["text"] = "test"; // r["title"] = "元旦到了"; // t.Rows.Add(r); // } // ls.UpdateIndexByData(t); //} /// <summary> /// 使用举例 创建Lucene List方式 /// </summary> //protected void creatLucene() //{ // DateTime dt = DateTime.Now; // List<luseneTxt> l = new List<luseneTxt>(); // bool flag = true; // int i = 0; // LuceneSearch ls = new LuceneSearch("~/App_Data/Files"); // while (flag) // { // m = new luseneTxt(); // m.text = "test"; // m.path = "http://www.baidu.com/?i="; // m.title = "mofijeck "; // m.des = "12"; // m.keyword = "34"; // m.id = i.ToString(); // m.createTime = DateTime.Now; // l.Add(m); // i++; // m = null; // if (i == 99999) // { // flag = false; // } // } // ls.CreatIndexByData<luseneTxt>(l); // l = new List<luseneTxt>(); // TimeSpan ts = DateTime.Now - dt; //} } }IntranetIndexer.cs 文件索引库(这里是用Lucene.Net demo里面的类库)
/* * Copyright 2012 dotlucene.net * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ using System.IO; using System.Text.RegularExpressions; using Lucene.Net.Analysis.Standard; using Lucene.Net.Documents; using Lucene.Net.Index; using Lucene.Net.Store; using Lucene.Net.Util; namespace QueryLucene { /// <summary> /// Summary description for Indexer. /// </summary> public class IntranetIndexer { private IndexWriter writer; private string docRootDirectory; private string pattern; /// <summary> /// Creates a new index in <c>directory</c>. Overwrites the existing index in that directory. /// </summary> /// <param name="directory">Path to index (will be created if not existing).</param> public IntranetIndexer(string directory) { writer = new IndexWriter(FSDirectory.Open(directory), new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.LIMITED); writer.UseCompoundFile = true; } /// <summary> /// Add HTML files from <c>directory</c> and its subdirectories that match <c>pattern</c>. /// </summary> /// <param name="directory">Directory with the HTML files.</param> /// <param name="pattern">Search pattern, e.g. <c>"*.html"</c></param> public void AddDirectory(DirectoryInfo directory, string pattern) { this.docRootDirectory = directory.FullName; this.pattern = pattern; addSubDirectory(directory); } private void addSubDirectory(DirectoryInfo directory) { foreach (FileInfo fi in directory.GetFiles(pattern)) { AddHtmlDocument(fi.FullName); } foreach (DirectoryInfo di in directory.GetDirectories()) { addSubDirectory(di); } } /// <summary> /// Loads, parses and indexes an HTML file. /// </summary> /// <param name="path"></param> public void AddHtmlDocument(string path) { Document doc = new Document(); string html; using (StreamReader sr = new StreamReader(path, System.Text.Encoding.Default)) { html = sr.ReadToEnd(); } int relativePathStartsAt = this.docRootDirectory.EndsWith("\\") ? this.docRootDirectory.Length : this.docRootDirectory.Length + 1; string relativePath = path.Substring(relativePathStartsAt); doc.Add(new Field("text", ParseHtml(html), Field.Store.YES, Field.Index.ANALYZED)); doc.Add(new Field("path", relativePath, Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.Add(new Field("title", GetTitle(html), Field.Store.YES, Field.Index.ANALYZED)); writer.AddDocument(doc); } /// <summary> /// Very simple, inefficient, and memory consuming HTML parser. Take a look at Demo/HtmlParser in DotLucene package for a better HTML parser. /// </summary> /// <param name="html">HTML document</param> /// <returns>Plain text.</returns> private static string ParseHtml(string html) { string temp = Regex.Replace(html, "<[^>]*>", ""); return temp.Replace(" ", " "); } /// <summary> /// Finds a title of HTML file. Doesn't work if the title spans two or more lines. /// </summary> /// <param name="html">HTML document.</param> /// <returns>Title string.</returns> private static string GetTitle(string html) { Match m = Regex.Match(html, "<title>(.*)</title>"); if (m.Groups.Count == 2) return m.Groups[1].Value; return "(unknown)"; } /// <summary> /// Optimizes and save the index. /// </summary> public void Close() { writer.Optimize(); writer.Dispose(); } public void Delete() { writer.DeleteAll(); } } }这里贡献一个按照时间排序的类,Lucene.Net3.0支持的排序类型 (BYTE CUSTOM DOC DOUBLE FLOAT INT LONG SCORE SHORT STRING STRING_VAL)
说明:这里是根据Java Lucene3.0代码改写的。感觉不是特别完善,但是够用了。
MySortComparatorSource.cs
using System; using System.Collections.Generic; using System.Linq; using System.Text; using Lucene.Net.Search; using Lucene.Net.Index; using System.Collections; namespace QueryLucene { //这里继承FieldComparatorSource类 ,然后返回DateValComparator的自定义子类,并传递两个参数 public class MySortComparatorSource : FieldComparatorSource { public override FieldComparator NewComparator(string fieldname, int numHits, int sortPos, bool reversed) { return new DateValComparator(numHits, fieldname); } } }
DateValComparator.cs
using System; using System.Collections.Generic; using System.Linq; using System.Text; using Lucene.Net.Index; using Lucene.Net.Search; namespace QueryLucene { //我们自定义排序类DateValComparator 继承FieldComparator 类实现里面的方法 public class DateValComparator : FieldComparator { private string[] values; private string[] currentReaderValues; private string field; private string bottom; public DateValComparator(int numHits, string field) { this.values = new String[numHits]; this.field = field; } #region 重写 //第三步调用的比较方法,如果查询结果没有超过超过了我们设定的总行数那么会调用这个方法进行比较 public override int Compare(int slot1, int slot2) { try { DateTime? val1=null,val2=null ; try { val1 = Convert.ToDateTime(this.values[slot1]); } catch { } try { val2 = Convert.ToDateTime(this.values[slot2]); } catch { } if (null == val1) { if (null == val2) { return 0; } return -1; } if (null == val2) { return 1; } return DateTime.Compare(Convert.ToDateTime(val1), Convert.ToDateTime(val2)); } catch { return 0; } } //Lucene的查询结果是放在优先队列里面的,优先对象是通过compare进行比较,如果查询结果超过了我们设定的总行数那么会第二步调用这个方法 public override int CompareBottom(int doc) { try { DateTime? val2 = null, tempBottom=null; try { val2 = Convert.ToDateTime(this.currentReaderValues[doc]); } catch { } try { tempBottom = Convert.ToDateTime(this.bottom); } catch { } if (tempBottom == null) { if (val2 == null) { return 0; } return -1; } if (val2 == null) { return 1; } return DateTime.Compare(Convert.ToDateTime(val2),Convert.ToDateTime(tempBottom)); } catch (Exception e) { //throw e; return 0; } } //第二步调用的方法,将currentReaderValues时间数组中的第一个值 copy到 values中对应的位置中,注意此方法第一次调用了两次 public override void Copy(int slot, int doc) { this.values[slot] = this.currentReaderValues[doc]; } //如果查询结果超过了我们设定的总行数那么会第一步调用这个方法 public override void SetBottom(int slot) { this.bottom = this.values[slot]; } //第一步调用的方法: 本方法是得到所有搜索到的时间数组并初始化currentValues,docBase用来确定需要几个数组 public override void SetNextReader(IndexReader reader, int docBase) { this.currentReaderValues = FieldCache_Fields.DEFAULT.GetStrings(reader, this.field); } public override IComparable this[int slot] { get { return this.values[slot]; } } #endregion } }
相关文章推荐
- 探索Lucene.Net全文检索
- Lucene.Net学习一:全文检索项目Lucene.Net介绍
- 仿造Baidu简单实现基于Lucene.net的全文检索的功能
- 全文检索开发包Lucene.Net 介绍及示例
- Lucene.Net 全文检索引擎的架构
- Spring.net整合Lucene.net 实现全文检索(附例程)转载
- 使用Lucene.Net实现全文检索
- Lucene .NET 全文检索
- 火力全开——仿造Baidu简单实现基于Lucene.net的全文检索的功能
- 火力全开——仿造Baidu简单实现基于Lucene.net的全文检索的功能
- 全文检索引擎开发包 Lucene.net
- 仿造Baidu简单实现基于Lucene.net的全文检索的功能
- 站内搜索------仿造Baidu简单实现基于Lucene.net的全文检索的功能
- 基于Lucene.net全文检索
- Apache Lucene与Lucene.Net——全文检索服务器
- 火力全开——仿造Baidu简单实现基于Lucene.net的全文检索的功能
- Apache Lucene与Lucene.Net——全文检索服务器
- 使用Lucene.Net实现全文检索
- Lucene.net 全文检索 盘古分词
- 全文检索引擎 Lucene.net