Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-40

Tika needs to support diverse character encodings.

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.1-incubating
    • 0.1-incubating
    • general
    • None

    Description

      Currently, the text parser implementation uses the default encoding of the Java runtime when instantiating a Reader for the passed input stream. We need to support other encodings as well.

      It would be helpful to support the specification of an encoding in the parse method.

      Ideally, Tika would also provide the ability to determine the encoding automatically based on the data stream. (Unicode files may have byte order marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other encodings can be inferred from content.)

      Attachments

        1. TIKA-40.patch
          6 kB
          Jukka Zitting

        Activity

          People

            jukkaz Jukka Zitting
            kbennett Keith Bennett
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: