Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1723

Integrate language-detector into Tika

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.11
    • 1.13
    • languageidentifier
    • None
    • Patch

    Description

      The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.

      This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if Chris A. Mattmann moves forward with turning language detection into a pluggable extension point.

      I'll add comments with results below.

      Attachments

        1. TIKA-1723v2.patch
          770 kB
          Tim Allison
        2. TIKA-1723-3.patch
          110 kB
          Kenneth William Krugler
        3. TIKA-1723-2.patch
          109 kB
          Kenneth William Krugler
        4. TIKA-1723.patch
          74 kB
          Kenneth William Krugler

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kkrugler Kenneth William Krugler
            kkrugler Kenneth William Krugler
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment