Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1696

Language Identification with Text Processing Toolkit from MITLL

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.13
    • Component/s: languageidentifier
    • Labels:
      None

      Description

      The aim here is to extend the methods for language identification within text. MIT Lincoln Labs has an open source library [1] written in Julia. Having spoken with the MITLL guys there is a possibility that there is a scala version of this library which would make it easier to package in with Tika.

      At this point I'm not quite sure how many languages this library supports by default but it can be extended when provided some training data.

      [1] https://github.com/mit-nlp/Text.jl

        Issue Links

          Activity

          Hide
          kkrugler Ken Krugler added a comment -

          Hi Paul - see https://issues.apache.org/jira/browse/TIKA-369 for a lengthy discussion of possible improvements to language detection.

          I started a new open source project to improve on what's available in language-detector (see https://github.com/kkrugler/yalder), and found that the latest version is pretty darn good...I could beat it on speed or accuracy, but for both it was about equal.

          So I'd be interested in finding out where/how the MITLL improves on the current version of language-detector, as that's a known Java-based solution that covers a lot of languages and has good performance/accuracy.

          Show
          kkrugler Ken Krugler added a comment - Hi Paul - see https://issues.apache.org/jira/browse/TIKA-369 for a lengthy discussion of possible improvements to language detection. I started a new open source project to improve on what's available in language-detector (see https://github.com/kkrugler/yalder ), and found that the latest version is pretty darn good...I could beat it on speed or accuracy, but for both it was about equal. So I'd be interested in finding out where/how the MITLL improves on the current version of language-detector, as that's a known Java-based solution that covers a lot of languages and has good performance/accuracy.
          Hide
          pramirez Paul Ramirez added a comment -

          Ken, thanks for the fast feedback and references. I've not dug into this much so it may take a couple of weeks to get something up here to test. As I dig into this I'll update the Jira issue with more details to help drive discussion. Also I'll look to get the MITLL guys posting here too as they would be better able to describe the details.

          What wasn't clear on TIKA-369 is whether yalder was going to come back into Tika. Intent here is to get to a patch integrating their code so it could be tested in the same way that Tika's current approach was tested. Hopefully that patch would help answer the questions above.

          They are forwarding me some research papers so I can come up to speed on this too so as I gain knowledge I'll flush out here.

          Do you think this should instead happen on TIKA-369?

          Show
          pramirez Paul Ramirez added a comment - Ken, thanks for the fast feedback and references. I've not dug into this much so it may take a couple of weeks to get something up here to test. As I dig into this I'll update the Jira issue with more details to help drive discussion. Also I'll look to get the MITLL guys posting here too as they would be better able to describe the details. What wasn't clear on TIKA-369 is whether yalder was going to come back into Tika. Intent here is to get to a patch integrating their code so it could be tested in the same way that Tika's current approach was tested. Hopefully that patch would help answer the questions above. They are forwarding me some research papers so I can come up to speed on this too so as I gain knowledge I'll flush out here. Do you think this should instead happen on TIKA-369 ?
          Hide
          chris.a.mattmann@jpl.nasa.gov Mattmann, Chris A (388J) added a comment -

          It's fine to discuss this on tika 1696

          Sent from my iPhone

          Show
          chris.a.mattmann@jpl.nasa.gov Mattmann, Chris A (388J) added a comment - It's fine to discuss this on tika 1696 Sent from my iPhone
          Hide
          pramirez Paul Ramirez added a comment -

          The algorithm that is used is described here:

          https://en.wikipedia.org/wiki/Margin_Infused_Relaxed_Algorithm

          Show
          pramirez Paul Ramirez added a comment - The algorithm that is used is described here: https://en.wikipedia.org/wiki/Margin_Infused_Relaxed_Algorithm
          Hide
          davemeikle Dave Meikle added a comment -
          • Pushed to 1.11 following 1.10 release
          Show
          davemeikle Dave Meikle added a comment - Pushed to 1.11 following 1.10 release
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          Show
          chrismattmann Chris A. Mattmann added a comment - Trevor Lewis FYI
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Paul Ramirez Trevor Lewis where do we stand with this?

          Show
          chrismattmann Chris A. Mattmann added a comment - Paul Ramirez Trevor Lewis where do we stand with this?
          Hide
          pramirez Paul Ramirez added a comment -

          Trevor has a patch to make this work with Tika 1.11. He mentioned that he posted the patch but I'm not seeing it here I'll hit him up as it may just be that he posted that in his GitHub repo.

          Show
          pramirez Paul Ramirez added a comment - Trevor has a patch to make this work with Tika 1.11. He mentioned that he posted the patch but I'm not seeing it here I'll hit him up as it may just be that he posted that in his GitHub repo.
          Hide
          lewistre Trevor Lewis added a comment -

          Actually, I am working on making it work with Tika 1.13 and I created a pull request.

          This is the link to the Text.jl REST Server Repo:
          https://github.com/trevorlewis/TEXT-Language-REST

          Show
          lewistre Trevor Lewis added a comment - Actually, I am working on making it work with Tika 1.13 and I created a pull request. This is the link to the Text.jl REST Server Repo: https://github.com/trevorlewis/TEXT-Language-REST
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          This is now done, Ken's Optimaize langdetect, N-gram langdetect and Text.jl from MIT are all now integrated:

          LMC-053601:tika1.13 mattmann$ git commit -m "Resolve conflicts in CHANGES.txt"
          [master 2caf3da] Resolve conflicts in CHANGES.txt
          LMC-053601:tika1.13 mattmann$ git push -u origin master
          Counting objects: 477, done.
          Delta compression using up to 8 threads.
          Compressing objects: 100% (237/237), done.
          Writing objects: 100% (477/477), 113.91 KiB | 0 bytes/s, done.
          Total 477 (delta 134), reused 320 (delta 67)
          remote: tika git commit: Resolve conflicts in CHANGES.txt
          remote: tika git commit: Update with information about TIKA-1872, TIKA-1696 and TIKA-1723.
          remote: tika git commit: Merge branch 'TIKA-1872'
          remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872
          remote: tika git commit: Updated TextLangDetector and fixed build errors
          remote: tika git commit: Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika into TIKA-1872
          remote: tika git commit: Depend on 1.13-SNAPSHOT, not 2.0.
          remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872
          remote: tika git commit: Added missing license headers
          remote: tika git commit: Add missing license headers
          remote: tika git commit: fix for TIKA-1872 contributed by trevorlewis
          remote: tika git commit: Make detector "discoverable", use that everywhere
          remote: tika git commit: Move base lang detect classes to core
          remote: tika git commit: Remove built-in lang detector
          remote: tika git commit: Add tika-langdetect dependency in other modules
          remote: tika git commit: Add project.build.sourceEncoding to properties
          remote: tika git commit: Roll in new lang detect support in new module
          remote: tika git commit: Add missing dependency on tika-test-resources
          To https://git-wip-us.apache.org/repos/asf/tika.git
             c9d508d..2caf3da  master -> master
          Branch master set up to track remote
          

          Thanks Ken Krugler and Trevor Lewis!

          Show
          chrismattmann Chris A. Mattmann added a comment - This is now done, Ken's Optimaize langdetect, N-gram langdetect and Text.jl from MIT are all now integrated: LMC-053601:tika1.13 mattmann$ git commit -m "Resolve conflicts in CHANGES.txt" [master 2caf3da] Resolve conflicts in CHANGES.txt LMC-053601:tika1.13 mattmann$ git push -u origin master Counting objects: 477, done. Delta compression using up to 8 threads. Compressing objects: 100% (237/237), done. Writing objects: 100% (477/477), 113.91 KiB | 0 bytes/s, done. Total 477 (delta 134), reused 320 (delta 67) remote: tika git commit: Resolve conflicts in CHANGES.txt remote: tika git commit: Update with information about TIKA-1872, TIKA-1696 and TIKA-1723. remote: tika git commit: Merge branch 'TIKA-1872' remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872 remote: tika git commit: Updated TextLangDetector and fixed build errors remote: tika git commit: Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika into TIKA-1872 remote: tika git commit: Depend on 1.13-SNAPSHOT, not 2.0. remote: tika git commit: Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872 remote: tika git commit: Added missing license headers remote: tika git commit: Add missing license headers remote: tika git commit: fix for TIKA-1872 contributed by trevorlewis remote: tika git commit: Make detector "discoverable", use that everywhere remote: tika git commit: Move base lang detect classes to core remote: tika git commit: Remove built-in lang detector remote: tika git commit: Add tika-langdetect dependency in other modules remote: tika git commit: Add project.build.sourceEncoding to properties remote: tika git commit: Roll in new lang detect support in new module remote: tika git commit: Add missing dependency on tika-test-resources To https://git-wip-us.apache.org/repos/asf/tika.git c9d508d..2caf3da master -> master Branch master set up to track remote Thanks Ken Krugler and Trevor Lewis !

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              pramirez Paul Ramirez
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development