We are currently using tika for text extraction. Currently some sites are returning zips that have entries with stored data descriptors which fail to extract due to the ZipArchiveInputStream (in commons-compress) defaulting to false for 'allowStoredEntriesWithDataDescriptor'.
Since ZipArchiveInputStream has support for reading zips with data descriptors we should attempt to read the zip with that feature enabled when we get a data descriptor UnsupportedZipFeatureException.
Pull Request: https://github.com/apache/tika/pull/356