Description
I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException.
It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1).
It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
What the community thinks about that?
Attachments
Attachments
Issue Links
- supercedes
-
TIKA-1541 StringsParser: a simple strings-based parser for Tika
- Resolved