ez projects / eztika

Apache Tika based binary file plain text extraction

UNIX name Owner Status Version Compatible with
eztika Paul Borgermans stable 1.5 4.x
A wrapper script for the standalone Tika toolkit that allows conversion to plain text and indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF, ....


  • Sun Java VM (JRE 1.6)
  • eZ Publish 4.x

Supported binary file formats

[application/octet-stream] (anything tika recognizes which is a lot!!)


See the included INSTALL.txt, make sure you adapt the various paths according to your environment if you need to and check the permissions/ownership of the executables for your server configuration.

Known issues

  • none so far


Changelog eztika 1.4 to 1.5

  • updated tika.jar to 1.0-snapshot (svn rev 1169702)
  • "zero-configuration" option: no need to adapt edit paths in the executable scripts, they will be executed relative from your eZ Publish root installation (Felix Woldt)
  • additional debug settings for success/failure of text extraction and optionally keeping the extracted text temp files (Felix Woldt)

Changelog eztika 1.3 to 1.4

  • updated tika.jar to version 1.0-snapshot (svn rev 1156078)
  • includes support for iWork, chm

Changelog eztika 1.2 to 1.3

  • updated tika.jar to 0.8-snapshot (rev 933934); now supporting correctly CJK pdfs
  • updated meta-info

Changelog eztika 1.1 to 1.2

  • updated tika.jar to 0.6-dev (rev 897576) which has better support for ms excel, a reduced footprint and output encoding options
  •  it now correctly converts OOo and MSxx formats with asian content properly to UTF-8
  • added encoding option --encoding=utf8 to eztika wrapper script
  • updated ezinfo.php
  • changed structure for building with

Changelog eztika 1.0 to 1.1

  • (added 2010-01-17) let ezpdftotext specify the no pagebreaks option, which potentialy break the UTF payload sent to ez find
  • updated tika.jar to 0.5-dev (rev 814142), including more support for office xml formats and various bugfixes
  • added office xml format mimetypes to binaryfile.ini
  • updated ezinfo.php

This project has no reviews yet. Be the first one to review it!