上篇有提到這段代碼:
DocHander docHander = DocHanderFactory.buildDocHander(fileName);
attachDocument = docHander.getDocument(attach);
下面我們看一看實現細節。
抽象類DocHander的代碼:
public abstract class DocHander {
public static String FIELD_CONTENT = "contents";
public abstract Document getDocument(byte[] inputByte) throws Exception;
protected Document addContent(Document document, String content){
document.add(new Field(DocHander.FIELD_CONTENT, content ,Field.Store.YES,Field.Index.TOKENIZED));
return document;
}
}
現看看工廠類DocHanderFactory的代碼:
public abstract class DocHanderFactory {
public static DocHander buildDocHander(String fileName){
DocHander docHander = null;
if (fileName.toLowerCase().endsWith(".doc")){
docHander = new WordDocHander();
}
else if(fileName.toLowerCase().endsWith(".xls")){
docHander = new ExcelDocHander();
}
else if(fileName.toLowerCase().endsWith(".pdf")){
docHander = new PdfDocHander();
}
else if(fileName.toLowerCase().endsWith(".html") || fileName.toLowerCase().endsWith(".htm")){
docHander = new HtmlDocHander();
}
else{
docHander = new TxtDocHander();
}
return docHander;
}
}
以下貼出WordDocHander、 ExcelDocHander、PdfDocHander的代碼,因為別人已經幫我們包好了,所以我們寫起來很簡單,非常感謝他們!
public class WordDocHander extends DocHander {
public Document getDocument(byte[] inputByte) throws IOException {
InputStream inputStream = new ByteArrayInputStream(inputByte);
// TODO Auto-generated method stub
Document document = new Document();
WordExtractor extractor = new WordExtractor(inputStream);
addContent(document,extractor.getText());
return document;
}
}
public class ExcelDocHander extends DocHander {
public Document getDocument(byte[] inputByte) throws IOException {
// TODO Auto-generated method stub
InputStream inputStream = new ByteArrayInputStream(inputByte);
Document document = new Document();
HSSFWorkbook wb = new HSSFWorkbook(inputStream);
ExcelExtractor extractor = new ExcelExtractor(wb);
extractor.setFormulasNotResults(true);
extractor.setIncludeSheetNames(false);
String content = extractor.getText();
return addContent(document, content);
}
}
public class PdfDocHander extends DocHander {
public Document getDocument(byte[] inputByte) throws IOException {
// Document document = LucenePDFDocument.getDocument(inputStream);//如何你不需要摘要顯示所搜索到的內容,就可以直接用這個簡單的方法
InputStream inputStream = new ByteArrayInputStream(inputByte);
Document document = new Document();
PDDocument pdfDocument = PDDocument.load(inputStream );
try {
if( pdfDocument.isEncrypted() )
{
//Just try using the default password and move on
pdfDocument.decrypt( "" );
}//create a writer where to append the text content.
StringWriter writer = new StringWriter();
PDFTextStripper stripper = new PDFTextStripper();
stripper.writeText( pdfDocument, writer );
String contents = writer.getBuffer().toString();
super.addContent(document, contents);
} catch (CryptographyException e) {
// TODO Auto-generated catch block
e.printStackTrace();
throw new IOException( "Error decrypting document: " + e );
} catch (InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
throw new IOException( "Error decrypting document: " + e );
}
return document;
}
}
分享到:
相关推荐
Lucene 支持的文档格式 - **PDF**: 通过 Apache Tika 或 PDFBox 库,Lucene 可以解析 PDF 文件内容并建立索引。 - **TXT**: 对纯文本文件,Lucene 直接读取内容进行索引。 - **Office 文件**: 包括 Word、Excel 和...
赠送jar包:lucene-analyzers-smartcn-7.7.0.jar; 赠送原API文档:lucene-analyzers-smartcn-7.7.0-javadoc.jar; 赠送源代码:lucene-analyzers-smartcn-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-...
赠送jar包:lucene-core-7.2.1.jar; 赠送原API文档:lucene-core-7.2.1-javadoc.jar; 赠送源代码:lucene-core-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-core-7.2.1.pom; 包含翻译后的API文档:lucene...
赠送jar包:lucene-core-7.7.0.jar; 赠送原API文档:lucene-core-7.7.0-javadoc.jar; 赠送源代码:lucene-core-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-7.7.0.pom; 包含翻译后的API文档:lucene...
赠送jar包:lucene-analyzers-common-6.6.0.jar; 赠送原API文档:lucene-analyzers-common-6.6.0-javadoc.jar; 赠送源代码:lucene-analyzers-common-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-...
赠送jar包:lucene-analyzers-smartcn-7.7.0.jar; 赠送原API文档:lucene-analyzers-smartcn-7.7.0-javadoc.jar; 赠送源代码:lucene-analyzers-smartcn-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-...
iTextPDFExtractor.java ------ ...--PDFBox创建PDF文件的Lucene索引 PDFBoxPathIndex.java ------- --PDFBox创建指定目录PDF文档索引 POIOfficeExtractor.java ----- -- POI处理Excel和Word文档代码
赠送jar包:lucene-backward-codecs-7.3.1.jar; 赠送原API文档:lucene-backward-codecs-7.3.1-javadoc.jar; 赠送源代码:lucene-backward-codecs-7.3.1-sources.jar; 赠送Maven依赖信息文件:lucene-backward-...
赠送jar包:lucene-spatial-extras-7.3.1.jar; 赠送原API文档:lucene-spatial-extras-7.3.1-javadoc.jar; 赠送源代码:lucene-spatial-extras-7.3.1-sources.jar; 赠送Maven依赖信息文件:lucene-spatial-extras...
赠送jar包:lucene-spatial-extras-7.2.1.jar; 赠送原API文档:lucene-spatial-extras-7.2.1-javadoc.jar; 赠送源代码:lucene-spatial-extras-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-spatial-extras...
赠送jar包:lucene-spatial-extras-6.6.0.jar; 赠送原API文档:lucene-spatial-extras-6.6.0-javadoc.jar; 赠送源代码:lucene-spatial-extras-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-spatial-extras...
赠送jar包:lucene-core-6.6.0.jar; 赠送原API文档:lucene-core-6.6.0-javadoc.jar; 赠送源代码:lucene-core-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-6.6.0.pom; 包含翻译后的API文档:lucene...
赠送jar包:lucene-backward-codecs-7.2.1.jar; 赠送原API文档:lucene-backward-codecs-7.2.1-javadoc.jar; 赠送源代码:lucene-backward-codecs-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-backward-...
赠送jar包:lucene-sandbox-6.6.0.jar; 赠送原API文档:lucene-sandbox-6.6.0-javadoc.jar; 赠送源代码:lucene-sandbox-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-sandbox-6.6.0.pom; 包含翻译后的API...
赠送jar包:lucene-backward-codecs-6.6.0.jar; 赠送原API文档:lucene-backward-codecs-6.6.0-javadoc.jar; 赠送源代码:lucene-backward-codecs-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-backward-...
赠送jar包:lucene-backward-codecs-6.6.0.jar; 赠送原API文档:lucene-backward-codecs-6.6.0-javadoc.jar; 赠送源代码:lucene-backward-codecs-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-backward-...
赠送jar包:lucene-highlighter-6.6.0.jar; 赠送原API文档:lucene-highlighter-6.6.0-javadoc.jar; 赠送源代码:lucene-highlighter-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-highlighter-6.6.0.pom;...
赠送jar包:lucene-suggest-6.6.0.jar; 赠送原API文档:lucene-suggest-6.6.0-javadoc.jar; 赠送源代码:lucene-suggest-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-suggest-6.6.0.pom; 包含翻译后的API...
赠送jar包:lucene-core-6.6.0.jar; 赠送原API文档:lucene-core-6.6.0-javadoc.jar; 赠送源代码:lucene-core-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-6.6.0.pom; 包含翻译后的API文档:lucene...
赠送jar包:lucene-sandbox-7.2.1.jar; 赠送原API文档:lucene-sandbox-7.2.1-javadoc.jar; 赠送源代码:lucene-sandbox-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-sandbox-7.2.1.pom; 包含翻译后的API...