Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1723

Integrate language-detector into Tika

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.11
    • 1.13
    • languageidentifier
    • None
    • Patch

    Description

      The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.

      This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if chrismattmann moves forward with turning language detection into a pluggable extension point.

      I'll add comments with results below.

      Attachments

        1. TIKA-1723v2.patch
          770 kB
          Tim Allison
        2. TIKA-1723-3.patch
          110 kB
          Kenneth William Krugler
        3. TIKA-1723-2.patch
          109 kB
          Kenneth William Krugler
        4. TIKA-1723.patch
          74 kB
          Kenneth William Krugler

        Issue Links

          Activity

            People

              kkrugler Kenneth William Krugler
              kkrugler Kenneth William Krugler
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: