Apache Tika based binary file plain text extraction
Last updated: Thursday 15 September 2011 10:39
| UNIX name |
Owner |
Status |
Version |
Compatible with |
| eztika |
Paul Borgermans |
stable
|
1.5
|
4.x
|
A wrapper script for the standalone Tika toolkit that allows conversion to plain text and indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF, ....
Requirements
- Sun Java VM (JRE 1.6)
- eZ Publish 4.x
Supported binary file formats
[application/pdf]
[application/msword]
[application/vnd.ms-excel]
[application/vnd.ms-powerpoint]
[application/vnd.visio]
[application/vnd.ms-outlook]
[application/xml]
[application/rtf]
[application/vnd.oasis.opendocument.text]
[application/vnd.oasis.opendocument.presentation]
[application/vnd.oasis.opendocument.spreadsheet]
[application/vnd.oasis.opendocument.formula]
[application/vnd.openxmlformats-officedocument.wordprocessingml.document]
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
[application/vnd.openxmlformats-officedocument.presentationml.presentation]
[application/octet-stream] (anything tika recognizes which is a lot!!)
[application/zip]
Installation
See the included INSTALL.txt, make sure you adapt the various paths according to your environment if you need to and check the permissions/ownership of the executables for your server configuration.
Known issues
Changelog
Changelog eztika 1.4 to 1.5
- updated tika.jar to 1.0-snapshot (svn rev 1169702)
- "zero-configuration" option: no need to adapt edit paths in the executable scripts, they will be executed relative from your eZ Publish root installation (Felix Woldt)
- additional debug settings for success/failure of text extraction and optionally keeping the extracted text temp files (Felix Woldt)
Changelog eztika 1.3 to 1.4
- updated tika.jar to version 1.0-snapshot (svn rev 1156078)
- includes support for iWork, chm
Changelog eztika 1.2 to 1.3
- updated tika.jar to 0.8-snapshot (rev 933934); now supporting correctly CJK pdfs
- updated meta-info
Changelog eztika 1.1 to 1.2
- updated tika.jar to 0.6-dev (rev 897576) which has better support for ms excel, a reduced footprint and output encoding options
- it now correctly converts OOo and MSxx formats with asian content properly to UTF-8
- added encoding option --encoding=utf8 to eztika wrapper script
- updated ezinfo.php
- changed structure for building with http://projects.ez.no/ezextensionbuilder
Changelog eztika 1.0 to 1.1
- (added 2010-01-17) let ezpdftotext specify the no pagebreaks option, which potentialy break the UTF payload sent to ez find
- updated tika.jar to 0.5-dev (rev 814142), including more support for office xml formats and various bugfixes
- added office xml format mimetypes to binaryfile.ini
- updated ezinfo.php
eZ Tika 1.5 released (binary file indexing)
Thursday 15 September 2011 10:27
Paul Borgermans