- 浏览: 492599 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
springdata_spring:
可以参考最新的文档:如何在eclipse jee中检出项目并转 ...
maven archetype:generate 的进一步理解 -
springaop_springmvc:
apache lucene开源框架demo使用实例教程源代码下 ...
lucene 使用教程<转> -
springmvc-freemarker:
可以参考最新的文档:如何在eclipse jee中检出项目并转 ...
maven 简单实用教程 -
nich002:
zealot 写道顶,推荐maven开发过程系列 大家不要点这 ...
maven 简单实用教程 -
刘宇斌:
您好 有个问题想请教您一下 您这个是通过jdbc连接的,如何 ...
云计算实战 (海量日志管理)hive -- hive + hiveclient (hive 客户端)
apache 是个伟大的组织。
在lucene 检索 如火如荼时, apache不忘继续努力,近期提供了对各种格式文件进行解析的解决方案 -- apache旗下的tika. 虽然还没有1.0版 , 但已经很好用:
/** * 解析各种类型文件 * @param 文件路径 * @return 文件内容字符串 */ public static String parse(String path) { String result = ""; TikaConfig tikaConfig = TikaConfig.getDefaultConfig(); try { result = ParseUtils.getStringContent(new File(path), tikaConfig); }catch (Exception e) { log.debug("[by ninja.hzw]" + e); } return result; }
很简单,可以解析各种文件,返回文档内容字符串, word2003/2007 、 pdf 、 txt 都经过测试,均能解析且无乱码问题。
oh, Great Apach
Tika 的下载和打包:
下载不用多说,google 一下“apache tika” 找到其官网下载即可。
To build Tika from sources you first need to either download a source release or checkout the latest sources from version control. Once you have the sources, you can build them using the Maven 2 build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository. mvn install
apache 已经说得很清楚,进入下载后的tika 目录 ,然后执行maven install 即可。(当然这里需要您懂得maven2的使用。当然不会的朋友可以联系我^^ . 还需注意,必须为jdk1.5 + 才能成功编译打包。)
打包完后产生以下 jar:
tika-core/target/tika-core-0.7.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5. tika-parsers/target/tika-parsers-0.7.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries. tika-app/target/tika-app-0.7.jar Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface. tika-bundle/target/tika-bundle-0.7.jar Tika bundle. An OSGi bundle that includes everything you need to use all Tika functionality in an OSGi environment.
我们要想做文档解析,只需引入tika-core 和 tika-parsers 即可。
当然如果您的项目是maven 构建的,那更好了。在pom里加上依赖:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>0.7</version> </dependency>
以及
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>0.7</version> </dependency>
maven 会自动下载。(感谢maven官方的支持。)
评论
2 楼
海底的乌鸦
2013-08-20
麻烦问下 这个是需要在maven环境下使用吗? 能不能用到java环境下 可以的话能不能帮忙整个包 实在不知道maven是什么东东
1 楼
eidolonprince
2011-04-13
您好,我按照您给出的方法在lucene里使用了tika,现在有个问题,就是每次在解析pdf的时候,不报错,但是会给出一大堆信息,解析其他格式的时候都不存在这个问题,能否给我一些建议:
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{16, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{15, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{13, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{14, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{17, 0}
2011-04-13 19:50:23,275 DEBUG [http-9080-2] org.apache.pdfbox.pdmodel.font.PDSimpleFont: Debug: Could not find encoding for COSName{Identity-H}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{P}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSDictionary{(COSName{MCID}:COSInt{0}) }
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BDC}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BT}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{F1}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{10.56}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tf}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{90.024}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{758.28}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tm}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{g}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{G}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSArray{[COSString{08?}, COSInt{11}, COSString{-;-;}, COSInt{11}, COSString{??}, COSInt{11}, COSString{7-
V}, COSInt{11}, COSString{>?L}, COSInt{11}, COSString{*?}]}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{TJ}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{ET}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{EMC}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{P}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSDictionary{(COSName{MCID}:COSInt{1}) }
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BDC}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BT}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{F2}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{10.56}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tf}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{226.61}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{758.28}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tm}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSArray{[COSString{ }]}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{TJ}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{ET}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{EMC}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{16, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{15, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{13, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{14, 0}
2011-04-13 19:50:22,984 DEBUG [http-9080-2] org.apache.pdfbox.pdfparser.PDFObjectStreamParser: parsed=COSObject{17, 0}
2011-04-13 19:50:23,275 DEBUG [http-9080-2] org.apache.pdfbox.pdmodel.font.PDSimpleFont: Debug: Could not find encoding for COSName{Identity-H}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{P}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSDictionary{(COSName{MCID}:COSInt{0}) }
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BDC}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BT}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{F1}
2011-04-13 19:50:23,318 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{10.56}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tf}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{90.024}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{758.28}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tm}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{g}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{G}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSArray{[COSString{08?}, COSInt{11}, COSString{-;-;}, COSInt{11}, COSString{??}, COSInt{11}, COSString{7-
V}, COSInt{11}, COSString{>?L}, COSInt{11}, COSString{*?}]}
2011-04-13 19:50:23,320 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{TJ}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{ET}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{EMC}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{P}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSDictionary{(COSName{MCID}:COSInt{1}) }
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BDC}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{BT}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSName{F2}
2011-04-13 19:50:23,323 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{10.56}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tf}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{0}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSInt{1}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{226.61}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSFloat{758.28}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{Tm}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: COSArray{[COSString{ }]}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{TJ}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{ET}
2011-04-13 19:50:23,325 DEBUG [http-9080-2] org.apache.pdfbox.util.PDFStreamEngine: processing substream token: PDFOperator{EMC}
发表评论
-
微博feed系统的push和pull模式和时间分区拉模式架构探
2011-02-15 13:58 2612sns系统,微博系统都应用到了feed(每条微博或者sns ... -
Eclipse没有提示解决办法
2011-01-18 20:38 2231Eclipse没有提示解决办法 window->Pref ... -
java 获得客户端真是ip地址
2010-12-28 15:46 1609/** * ip校验 * @p ... -
java.lang.NoSuchMethodError: antlr.collections.AST.getLine()I
2010-10-19 12:56 1719请把antlr 的jar 换成你所下载的hibernate 自 ... -
java.lang.NoSuchFieldError: MANUAL
2010-10-19 12:54 2804请吧hibernate的jar 替换成 3.2 或以上版本。 -
找不到 org/springframework/dao/support/PersistenceExceptionTranslator
2010-10-18 17:10 4285如果用的spring2 则原因是缺少spring-dao.j ... -
eclipse 伽利略 卸载插件
2010-10-11 16:58 1708最近才知道原来Eclipse还可以自己卸载已 ... -
Error deploying artifact: Failed to transfer file:XXXXX Return co de is: 401
2010-10-11 11:09 6936原文出处: http://www.javatang.com ... -
警告:xxxx 是 Sun 的专用 API,可能会在未来版本中删除
2010-10-09 13:43 5721[ERROR] BUILD FAILURE [INFO ... -
人人网 paoding 旗下 rose 项目实现服务器端 portal
2010-10-05 14:40 4853“portal”一词中文翻译为“门户”,所谓门户是指各种信息的 ... -
url 传参 (location.href 或 firame的src 等) 乱码的解决方案(屡试不爽)
2010-09-26 15:23 1889String xxxxx = new String(requ ... -
cannot make any changes to the index (it was opened with readOnly = true)
2010-08-15 09:14 1979此异常为用IndexReader 删除索引所报的错。 异常原 ... -
Java 解析word(2003/2007)
2010-08-01 18:01 10060现在 microsoft word 有好几个版本 97、200 ... -
no segments* file found in org.apache.lucene.store.FSDirectory
2010-08-01 17:59 4639遇到以上异常,原因: writer = ... -
获得只有 [年 月 日] 的Date 对象
2010-07-27 00:31 4796SimpleDateFormat simpleDateForm ... -
java 与 SqlServer 交互遇到的问题及解决方案<updating>
2010-07-23 16:04 1765----------【HIbernate】java.lang. ... -
weblogic下 ClassNotFoundException: org.hibernate.hql.ast.HqlToken 异常解决
2010-07-05 06:24 1993拥有Hibernate3.jar的应用,被部署到w ... -
spring 3.0 + 使用手记
2010-07-01 17:04 17061) ModelAndView 跳转问题 今天遇到了个纠结的 ... -
weblogic 下解决文件批读取
2010-06-22 16:36 1520自己写了些sql 脚本的配置文件,以前在tomca ... -
由replaceAll引发的java.util.regex.PatternSyntaxException错误
2010-06-13 03:18 1813如text.replaceAll(filename, newP ...
相关推荐
apache-tika-0.8-src.jar 源码
tika-python 绑定到 Apache Tika REST 服务 Python binding to the Apache Tika REST services Apache Tika 库的 Python 端口,可使用 Tika REST 服务器使 Tika 可用。这使得 Apache Tika 可作为 Python 库使用,可...
apache-tika-1.0-src.zip,tika 1.0版本 源码包,看孔浩的搜索引擎视频用到的。
tika 1.2 内容收集,包含了poi等工具,可处理word、pdf等文档
Tika是Apache下开源的文档内容解析工具,支持上千种文档格式(如PPT、XLS、PDF)。Tika使用统一的方法对各种类型文件进行内容解析,封装了各种格式解析的内部实现,可用于搜索引擎索引、内容分析、转换等场景。
Apache Tika本产品包括在以下位置开发的软件Apache软件基金会。版权所有1993-2010大学大气研究公司/ Unidata该软件包含源自UCAR / Unidata的NetCDF库的代码。Tika服务器组件使用CDDL许可的依赖项
apache-tika-1.2-src.zip
下载Apache的tika项目时发现网上没有现成的tika的jar文件,只能自己编译一个了。可能大家也会遇到这个问题。所以将编译好的jar包传上来于大家分享。其中包含了tika-app-0.5.jar,tika-core-0.5.jar,tika-parsers-...
tika-app-1.7.jar
tika-app-1.16,java文档内容提取工具jar包,可提取office文档内容
Apache Tika解析doc/docx/txt/xls等文件内容,可以很方便地将文档内容提取出来,方便做全文检索使用。
apache基金项目tika,是一个可以对内容进行分析、提取的开发包,结合正则开发包,可以开发基于垂直搜索引擎,目前正处于孵化阶段,这里是目前的最新版本
tika-app.1.19.1.jar,轻松提取文本正文的工具。。。。
提卡示例 使用Apache Tika进行文件类型检测 使用检测项目中文件的类型(csv,xml等)。 一个有关我的即将发布的博客文章的项目。
检测并提取来自上千种不同文件类型(如PPT,XLS和PDF)中的元数据和结构化文本。除了用gui进行操作外,还可以在命令行界面中使用java -jar tika-app-1.15.jar --text .doc命令进行文本格式的转换,text为要转变的...
ache tika 0.9 source 源文件 已经使用mvn生成对应的jar文件;app没有,需单独下载
Apache Tika 利用现有的解析类库,从不同格式的文档中(例如HTML, PDF, Doc),侦测和提取出元数据和结构化内容。 功能包括: 侦测文档的类型,字符编码,语言,等其他现有文档的属性。 提取结构化的文字内容。...
最新tika1.8,可以帮助lucene的开发,提取文档的内容
Tika-Python是与Apache Tika:trade_mark:REST服务绑定的Python,允许在Python社区中本地调用Tika。 tika-python Apache Tika库的Python端口,可通过Tika REST Server使Tika可用。 这使得Apache Tika可以作为Python库...
pache Tika 利用现有的解析类库 从不同格式的文档中(例如HTML PDF Doc 侦测和提取出元数据和结构化内容 功能包括: 侦测文档的类型 字符编码 语言 等其他现有文档的属性