Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1723

Integrate language-detector into Tika

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.13
    • Component/s: languageidentifier
    • Labels:
      None
    • Flags:
      Patch

      Description

      The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.

      This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if Chris A. Mattmann moves forward with turning language detection into a pluggable extension point.

      I'll add comments with results below.

        Attachments

        1. TIKA-1723.patch
          74 kB
          Ken Krugler
        2. TIKA-1723-2.patch
          109 kB
          Ken Krugler
        3. TIKA-1723-3.patch
          110 kB
          Ken Krugler
        4. TIKA-1723v2.patch
          770 kB
          Tim Allison

          Issue Links

            Activity

              People

              • Assignee:
                kkrugler Ken Krugler
                Reporter:
                kkrugler Ken Krugler
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: