[TIKA-1483] Create a Latin1 charset raw string parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6
Fix Version/s: 1.8
Component/s: parser
Labels:
None

Description

I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException.

It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1).
It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.

What the community thinks about that?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TIKA-1483_v2.patch
17/Feb/15 15:42
13 kB
Luís Filipe Nassif
TIKA-1483.patch
15/Feb/15 18:46
13 kB
Luís Filipe Nassif

Issue Links

supercedes

TIKA-1541 StringsParser: a simple strings-based parser for Tika

Resolved

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Luís Filipe Nassif

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Nov/14 01:52

Updated:: 26/Feb/15 04:43

Resolved:: 26/Feb/15 03:46