[TIKA-1541] StringsParser: a simple strings-based parser for Tika - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.8
Component/s: parser
Labels:
None

Description

I thought to implement an extremely simple implementation of StringsParser, a parser based on the strings command (or strings-alternative command), instead of using the dummy EmptyParser for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on TesseractOCRParser. You can find the patch in attachment.

I created a GitHub repository for sharing the code. As first test, you can clone the repo, build the code using the build.sh script, and then run the parser using the run.sh script on some govdocs1 files (grabbed from "016" subset) detected as application/octet-stream. The latter script launches a simple StringsTest class for testing.

I hope you will find the StringsParser a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many "sophisticated" forensics tools work in a similar manner for indexing purposes. They use a sort of strings command against files that they are not able to detect.

In addition to run strings on undetected files, the StringsParser launches the file command on undetected files and then writes the output in the strings:file_output property (I noticed that sometimes the file command is able to detect the media type for documents not detected by Tika).

Finally, you can fine an old discussion about this topic here. Thanks chrismattmann.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

testOCTET_header.dbase3
08/Feb/15 02:24
0.2 kB
Giuseppe Totaro
TIKA-1541.patch
05/Feb/15 23:33
11 kB
Giuseppe Totaro
TIKA-1541.TotaroMattmann.020615.patch.txt
07/Feb/15 17:20
12 kB
Chris A. Mattmann
TIKA-1541.TotaroMattmann.020615.patch.txt
07/Feb/15 02:34
12 kB
Chris A. Mattmann
TIKA-1541.TotaroMattmannBurchNassif.020715.patch
08/Feb/15 02:24
24 kB
Giuseppe Totaro
TIKA-1541.TotaroMattmannBurchNassif.020815.patch
09/Feb/15 07:59
24 kB
Giuseppe Totaro
TIKA-1541.TotaroMattmannBurchNassif.020915.patch
10/Feb/15 08:04
24 kB
Giuseppe Totaro
TIKA-1541.v02.02182015.patch
19/Feb/15 05:56
14 kB
Giuseppe Totaro

Issue Links

is superceded by

TIKA-1483 Create a Latin1 charset raw string parser

Resolved

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Giuseppe Totaro

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 05/Feb/15 23:13

Updated:: 29/Oct/19 15:52

Resolved:: 10/Feb/15 23:30