Content Detection, Metadata and Content Extraction with Apache Tika

Sun, 02 Dec 2012 00:00:00 +0100

Encountering the situation that you want to extract meta-data or content from a file – might it be an office document, a spreadsheet or even a mp3 or an image – or you’d like to detect the content type for a given file then Apache Tika might be a helpful tool for you.

Apache Tika supports a variety of document formats and has a nice, extendable parser and detection API with a lot of built-in parsers available.

Formats on Micha Kops' Tech Notes

Content Detection, Metadata and Content Extraction with Apache Tika