使用solr创建 附件[word pdf txt等文件索引]
2016-11-03 21:43
501 查看
官方给出的ContentStreamUpdateRequest样例:
1 package javaapplicationsolrcell; 2 3 import java.io.File; 4 import java.io.IOException; 5 import org.apache.solr.client.solrj.SolrServer; 6 import org.apache.solr.client.solrj.SolrServerException; 7 8 import org.apache.solr.client.solrj.request.AbstractUpdateRequest; 9 import org.apache.solr.client.solrj.response.QueryResponse; 10 import org.apache.solr.client.solrj.SolrQuery; 11 import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; 12 import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest; 13 14 /** 15 * @author EDaniel 16 */ 17 public class SolrExampleTests { 18 19 public static void main(String[] args) { 20 try { 21 //Solr cell can also index MS file (2003 version and 2007 version) types. 22 String fileName = "c:/Sample.pdf"; 23 //this will be unique Id used by Solr to index the file contents. 24 String solrId = "Sample.pdf"; 25 26 indexFilesSolrCell(fileName, solrId); 27 28 } catch (Exception ex) { 29 System.out.println(ex.toString()); 30 } 31 } 32 33 /** 34 * Method to index all types of files into Solr. 35 * @param fileName 36 * @param solrId 37 * @throws IOException 38 * @throws SolrServerException 39 */ 40 public static void indexFilesSolrCell(String fileName, String solrId) 41 throws IOException, SolrServerException { 42 43 String urlString = "http://localhost:8983/solr"; 44 SolrServer solr = new CommonsHttpSolrServer(urlString); 45 46 ContentStreamUpdateRequest up 47 = new ContentStreamUpdateRequest("/update/extract"); 48 49 up.addFile(new File(fileName)); 50 51 up.setParam("literal.id", solrId); 52 up.setParam("uprefix", "attr_"); 53 up.setParam("fmap.content", "attr_content"); 54 55 up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); 56 57 solr.request(up); 58 59 QueryResponse rsp = solr.query(new SolrQuery("*:*")); 60 61 System.out.println(rsp); 62 } 63 }
通过以上方法可以看出给solr上传文件使用ContentStreamUpdateRequest 封装请求对象,利用
solr.request(up) 实现post请求.
literal 设置的id 是你的schema.xml 中key值 对应的 solrId为其value值,
在schema.xml 中 配置 <field name="id" type="string" indexed="true" stored="true" required="false" multiValued="false" />
官网请求参数详细介绍如下:
Input Parametersfmap.<source_field>=<target_field> - Maps (moves) one field name to another. Example: fmap.content=text will cause the content field normally generated by Tika to be moved to the "text" field.
boost.<fieldname>=<float> - Boost the specified field.
literal.<fieldname>=<value> - Create a field with the specified value. May be multivalued if the Field is multivalued.
uprefix=<prefix> - Prefix all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>
defaultField=<Field Name> - If uprefix is not specified and a Field cannot be determined, the default field will be used.
extractOnly=true|false - Default is false. If true, return the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags. See TikaExtractOnlyExampleOutput.
resource.name=<File Name> - The optional name of the file. Tika can use it as a hint for detecting mime type.
capture=<Tika XHTML NAME> - Capture XHTML elements with the name separately for adding to the Solr document. This can be useful for grabbing chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (<p>) and index them into a separate field. Note that content is also still captured into the overall "content" field.
captureAttr=true|false - Index attributes of the Tika XHTML elements into separate fields, named after the element. For example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a". See the examples below.
xpath=<XPath expression> - When extracting, only return Tika XHTML content that satisfies the XPath expression. See http://tika.apache.org/1.2/parser.html for details on the format of Tika XHTML. See also TikaExtractOnlyExampleOutput.
lowernames=true|false - Map all field names to lowercase with underscores. For example, Content-Type would be mapped to content_type.
literalsOverride=true|false -
Solr4.0 When true, literal field values will override other values with same field name, such as metadata and content. If false, then literal field values will be appended to any extracted data from Tika, and the resulting field needs to be multi valued. Default: true
resource.password=<password> -
Solr4.0 The optional password for a password protected PDF or OOXML file. File format support depends on Tika.
passwordsFile=<file name> -
Solr4.0 The optional name of a file containing file name pattern to password mappings. See chapter "Encrypted Files" below
If extractOnly is true, additional input parameters:extractFormat=xml|text - Default is xml. Controls the serialization format of the extract content. xml format is actually XHTML, like passing the -x command to the tika command line application, while text is like the -t command.
Order of field operations
fields are generated by Tika or passed in as literals via literal.fieldname=value.Before Solr4.0 or if literalsOverride=false, then literals will be appended as multi-value to tika generated field.
if lowernames==true, fields are mapped to lower case
mapping rules fmap.source=target are applied
if uprefix is specified, any unknown field names are prefixed with that value, else if defaultField is specified, unknown fields are copied to that.
代码中存在以上实现还不够,
此请求到/update/extract 这个请求处理器在solrconfig.xml中必须有相应的配置
<lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" /> <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>
且在solr的lib下存放
contrib/extraction/lib 及
solr-cell-的jar包
相关文章推荐
- lucent检索技术之创建索引:使用POI读取txt/word/excel/ppt/pdf内容
- Solr索引pdf.txt.word等文件
- ios 使用NSURlSession 下载文件并且使用进度条进行文件下载进度读取以及打开附件(word,excel,ppt,pdf)显示
- Solrj创建doc/pdf/txt文件索引,高亮查询
- 使用PHP读取和创建txt,doc,xls,pdf类型文件
- jeecms系统使用介绍——通过二次开发实现对word、pdf、txt等上传附件的全文检索
- PDF文件转化成word,ppt,excel,图片(png,jpg...),tiff,rtf,txt,html,PDF组合,PDF编辑,PDF创建
- JAVA读取WORD,EXCEL,PDF,TXT,RTF,HTML文件文本内容的方法示例
- word转换为pdf 创建临时文件时出现错误:Word未能写某些嵌入对象,因为内容或磁盘空间不足
- How to Insert OLE Object (Adobe Acrobat Document) in Word with C#(如何使用C#在Word中插入OLE对象-PDF文件)
- JAVA读取WORD,EXCEL,PDF,TXT,RTF,HTML文件文本内容的方法示例
- 在IE浏览器中打开WORD、EXCEL、PDF和TXT文件
- PHP读取或者创建txt,doc,xls,pdf各个类型文件
- 使用C#创建修改合并PDF文件
- Apache tika -- 解析多种类型(word、pdf、txt 等)文件!
- 使用开源包pdfbox将pdf文件批量转换成txt文件
- C#使用指定打印机打印Word,Excel等Office文件和打印PDF文件的代码
- JAVA读取WORD,EXCEL,PDF,TXT,RTF,HTML文件文本内容的方法示例
- JAVA读取WORD,EXCEL,PDF,TXT,RTF,HTML文件文本内容的方法示例
- lucene 索引非txt文档 (pdf word rtf html xml)