Tika
  1. Tika
  2. TIKA-574

Support for IBM866 (CP866) encoding in TXTParser

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8, 0.9, 0.10
    • Component/s: parser
    • Labels:
      None
    • Environment:

      GNU/Linux 2.6.35-23, openjdk6

      Description

      There's no recognizer for CP866 (DOS russian encoding) in tika yet.

      1. TIKA-574.patch
        8 kB
        Maxim Valyanskiy
      2. tika-0.8-cp866.patch
        6 kB
        Konstantin Gribov

        Activity

        Hide
        Konstantin Gribov added a comment -

        I've used ngrams from cp1251 and wrote custom byteMap. All russian letters, used in cp1251 are present in cp866, so no changes in NGrams needed.

        Added inner static class in CharsetRecog_sbcs and CharsetDetector#createRecognizers modified to register this class.

        Show
        Konstantin Gribov added a comment - I've used ngrams from cp1251 and wrote custom byteMap. All russian letters, used in cp1251 are present in cp866, so no changes in NGrams needed. Added inner static class in CharsetRecog_sbcs and CharsetDetector#createRecognizers modified to register this class.
        Hide
        Maxim Valyanskiy added a comment -

        Thank you. I added unit-test for this issue

        Show
        Maxim Valyanskiy added a comment - Thank you. I added unit-test for this issue
        Hide
        Maxim Valyanskiy added a comment -

        Thank you. Commited in r1050348

        Show
        Maxim Valyanskiy added a comment - Thank you. Commited in r1050348

          People

          • Assignee:
            Unassigned
            Reporter:
            Konstantin Gribov
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development