Path

ez projects / eztika


Apache Tika based binary file plain text extraction

UNIX name Owner Status Version Compatible with
eztika Paul Borgermans stable 1.5 4.x
A wrapper script for the standalone Tika toolkit that allows conversion to plain text and indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF, ....

Requirements

  • Sun Java VM (JRE 1.6)
  • eZ Publish 4.x

Supported binary file formats

[application/pdf]
[application/msword]
[application/vnd.ms-excel]
[application/vnd.ms-powerpoint]
[application/vnd.visio]
[application/vnd.ms-outlook]
[application/xml]
[application/rtf]
[application/vnd.oasis.opendocument.text]
[application/vnd.oasis.opendocument.presentation]
[application/vnd.oasis.opendocument.spreadsheet]
[application/vnd.oasis.opendocument.formula]
[application/vnd.openxmlformats-officedocument.wordprocessingml.document]
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
[application/vnd.openxmlformats-officedocument.presentationml.presentation]
[application/octet-stream] (anything tika recognizes which is a lot!!)
[application/zip]

Installation

See the included INSTALL.txt, make sure you adapt the various paths according to your environment if you need to and check the permissions/ownership of the executables for your server configuration.

Known issues

  • none so far

Changelog

Changelog eztika 1.4 to 1.5

  • updated tika.jar to 1.0-snapshot (svn rev 1169702)
  • "zero-configuration" option: no need to adapt edit paths in the executable scripts, they will be executed relative from your eZ Publish root installation (Felix Woldt)
  • additional debug settings for success/failure of text extraction and optionally keeping the extracted text temp files (Felix Woldt)

Changelog eztika 1.3 to 1.4

  • updated tika.jar to version 1.0-snapshot (svn rev 1156078)
  • includes support for iWork, chm

Changelog eztika 1.2 to 1.3

  • updated tika.jar to 0.8-snapshot (rev 933934); now supporting correctly CJK pdfs
  • updated meta-info

Changelog eztika 1.1 to 1.2

  • updated tika.jar to 0.6-dev (rev 897576) which has better support for ms excel, a reduced footprint and output encoding options
  •  it now correctly converts OOo and MSxx formats with asian content properly to UTF-8
  • added encoding option --encoding=utf8 to eztika wrapper script
  • updated ezinfo.php
  • changed structure for building with http://projects.ez.no/ezextensionbuilder

Changelog eztika 1.0 to 1.1

  • (added 2010-01-17) let ezpdftotext specify the no pagebreaks option, which potentialy break the UTF payload sent to ez find
  • updated tika.jar to 0.5-dev (rev 814142), including more support for office xml formats and various bugfixes
  • added office xml format mimetypes to binaryfile.ini
  • updated ezinfo.php

This project has no reviews yet. Be the first one to review it!