Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1483

Create a Latin1 charset raw string parser

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6
    • 1.8
    • parser
    • None

    Description

      I think it can be very useful adding a general parser able to extract raw strings from files (like the strings command), which can be used as the fallback parser for all mimetypes not having a specific parser implementation, like application/octet-stream. It can also be used as a fallback for corrupt files throwing a TikaException.

      It must be configured with the script/language to be extracted from the files (currently I implemented one specific for Latin1).
      It can use heuristics to extract strings encoded with different charsets within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.

      What the community thinks about that?

      Attachments

        1. TIKA-1483_v2.patch
          13 kB
          Luís Filipe Nassif
        2. TIKA-1483.patch
          13 kB
          Luís Filipe Nassif

        Issue Links

          Activity

            People

              chrismattmann Chris A. Mattmann
              lfcnassif Luís Filipe Nassif
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: