Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1541

StringsParser: a simple strings-based parser for Tika

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.8
    • parser
    • None

    Description

      I thought to implement an extremely simple implementation of StringsParser, a parser based on the strings command (or strings-alternative command), instead of using the dummy EmptyParser for undetected files. It is a preliminary work (you can see a lot of todos). It is inspired by the work on TesseractOCRParser. You can find the patch in attachment.

      I created a GitHub repository for sharing the code. As first test, you can clone the repo, build the code using the build.sh script, and then run the parser using the run.sh script on some govdocs1 files (grabbed from "016" subset) detected as application/octet-stream. The latter script launches a simple StringsTest class for testing.

      I hope you will find the StringsParser a good solution for extracting ASCII strings from undetected filetypes. As far as I understood, many "sophisticated" forensics tools work in a similar manner for indexing purposes. They use a sort of strings command against files that they are not able to detect.

      In addition to run strings on undetected files, the StringsParser launches the file command on undetected files and then writes the output in the strings:file_output property (I noticed that sometimes the file command is able to detect the media type for documents not detected by Tika).

      Finally, you can fine an old discussion about this topic here. Thanks chrismattmann.

      Attachments

        1. TIKA-1541.patch
          11 kB
          Giuseppe Totaro
        2. TIKA-1541.TotaroMattmann.020615.patch.txt
          12 kB
          Chris A. Mattmann
        3. TIKA-1541.TotaroMattmann.020615.patch.txt
          12 kB
          Chris A. Mattmann
        4. testOCTET_header.dbase3
          0.2 kB
          Giuseppe Totaro
        5. TIKA-1541.TotaroMattmannBurchNassif.020715.patch
          24 kB
          Giuseppe Totaro
        6. TIKA-1541.TotaroMattmannBurchNassif.020815.patch
          24 kB
          Giuseppe Totaro
        7. TIKA-1541.TotaroMattmannBurchNassif.020915.patch
          24 kB
          Giuseppe Totaro
        8. TIKA-1541.v02.02182015.patch
          14 kB
          Giuseppe Totaro

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              gostep Giuseppe Totaro
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: