https://www.hascode.com/content-detection-metadata-and-content-extraction-with-apache-tika/