Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1876

Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.13
    • Component/s: parser
    • Labels:
      None

      Description

      Hi all,

      Apache Tika already performs Named Entity Recognition using Open NLP and Stanford Core NLP. Natural Language Toolkit is another open source python library and I believe it will be a great idea to have NLTK integrated along with Tika.
      NLTK can extract NER as well as classify them. For this purpose I, along with Prof Chris Mattmann have published NLTKRest, a python pip/setuptools installable module that exposes NLTK as a REST service.

      I have tested the working of Tika along with NLTKRest on my local repository and will soon submit a pull request.
      Link to rest server: https://github.com/manalishah/NLTKRest

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user manalishah opened a pull request:

        https://github.com/apache/tika/pull/80

        Integrate NLTK with Tika fix for TIKA-1876 contributed by manalishah

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/manalishah/tika TIKA-1876

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/80.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #80


        commit c809690ec87ffa600018dbc5eee6d6756645adb0
        Author: manali <manalishah.91@gmail.com>
        Date: 2016-02-27T03:58:06Z

        fix for TIKA-1876 contributed by manalishah

        commit 3a7e24c9a5d77ae41bde0c2106233a2064c5e707
        Author: manali <manalishah.91@gmail.com>
        Date: 2016-02-27T04:00:05Z

        fix for TIKA-1876 contributed by manalishah

        commit 114d0ff24bd04395852012a3382d50c3e906e6db
        Author: manali <manalishah.91@gmail.com>
        Date: 2016-02-27T04:06:20Z

        fix for TIKA-1876 contributed by manalishah

        commit cdb684d9c1b0ebb01a783180f07417760fa04d6f
        Author: manali <manalishah.91@gmail.com>
        Date: 2016-02-27T10:10:06Z

        fix for TIKA-1876 contributed by manalishah


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user manalishah opened a pull request: https://github.com/apache/tika/pull/80 Integrate NLTK with Tika fix for TIKA-1876 contributed by manalishah You can merge this pull request into a Git repository by running: $ git pull https://github.com/manalishah/tika TIKA-1876 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 commit c809690ec87ffa600018dbc5eee6d6756645adb0 Author: manali <manalishah.91@gmail.com> Date: 2016-02-27T03:58:06Z fix for TIKA-1876 contributed by manalishah commit 3a7e24c9a5d77ae41bde0c2106233a2064c5e707 Author: manali <manalishah.91@gmail.com> Date: 2016-02-27T04:00:05Z fix for TIKA-1876 contributed by manalishah commit 114d0ff24bd04395852012a3382d50c3e906e6db Author: manali <manalishah.91@gmail.com> Date: 2016-02-27T04:06:20Z fix for TIKA-1876 contributed by manalishah commit cdb684d9c1b0ebb01a783180f07417760fa04d6f Author: manali <manalishah.91@gmail.com> Date: 2016-02-27T10:10:06Z fix for TIKA-1876 contributed by manalishah
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/80

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/80
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        thanks Manali Shah I integrated this!

        [mattmann-0420740:~/tmp/tika1.13] mattmann% git push -u origin master
        Counting objects: 452, done.
        Delta compression using up to 4 threads.
        Compressing objects: 100% (140/140), done.
        Writing objects: 100% (321/321), 29.69 KiB | 0 bytes/s, done.
        Total 321 (delta 93), reused 274 (delta 67)
        To https://git-wip-us.apache.org/repos/asf/tika.git
           7c245fa..9056894  master -> master
        Branch master set up to track remote branch master from origin.
        [mattmann-0420740:~/tmp/tika1.13] mattmann% 
        
        Show
        chrismattmann Chris A. Mattmann added a comment - thanks Manali Shah I integrated this! [mattmann-0420740:~/tmp/tika1.13] mattmann% git push -u origin master Counting objects: 452, done. Delta compression using up to 4 threads. Compressing objects: 100% (140/140), done. Writing objects: 100% (321/321), 29.69 KiB | 0 bytes/s, done. Total 321 (delta 93), reused 274 (delta 67) To https://git-wip-us.apache.org/repos/asf/tika.git 7c245fa..9056894 master -> master Branch master set up to track remote branch master from origin. [mattmann-0420740:~/tmp/tika1.13] mattmann%
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-trunk-jdk1.7 #917 (See https://builds.apache.org/job/tika-trunk-jdk1.7/917/)
        fix for TIKA-1876 contributed by manalishah (manalishah.91: rev a13369b098bea09421e35023c131adc092dcb6e4)

        • tika-parsers/src/test/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniserTest.java
        • tika-parsers/src/main/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniser.java
        • tika-parsers/src/main/resources/org/apache/tika/parser/ner/nltk/NLTKServer.properties
          fix for TIKA-1876 contributed by manalishah (manalishah.91: rev 7ebe007ec03088449f67619ef1e6cb564178b14b)
        • tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java
        • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • tika-server/src/main/java/org/apache/tika/server/RichTextContentHandler.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
        • tika-parsers/src/main/java/org/apache/tika/parser/ner/NERecogniser.java
        • tika-core/src/main/java/org/apache/tika/mime/MimeType.java
        • CHANGES.txt
        • tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java
          fix for TIKA-1876 contributed by manalishah (manalishah.91: rev c809690ec87ffa600018dbc5eee6d6756645adb0)
        • .gitignore
        • tika-parsers/src/main/resources/org/apache/tika/parser/ner/nltk/NLTKServer.properties
        • tika-parsers/src/main/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniser.java
        • tika-parsers/src/test/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniserTest.java
          fix for TIKA-1876 contributed by manalishah (manalishah.91: rev 3a7e24c9a5d77ae41bde0c2106233a2064c5e707)
        • .gitignore
          fix for TIKA-1876 contributed by manalishah (manalishah.91: rev 114d0ff24bd04395852012a3382d50c3e906e6db)
        • tika-parsers/pom.xml
          fix for TIKA-1876 contributed by manalishah (manalishah.91: rev cdb684d9c1b0ebb01a783180f07417760fa04d6f)
        • tika-parsers/src/main/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniser.java
          Fix for TIKA-1876 Integrate Natural Language Toolkit (NLTK) into Tika (mattmann: rev 3fbc03cead1c54bd023a19e52e31609b51926d7d)
        • CHANGES.txt
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #917 (See https://builds.apache.org/job/tika-trunk-jdk1.7/917/ ) fix for TIKA-1876 contributed by manalishah (manalishah.91: rev a13369b098bea09421e35023c131adc092dcb6e4) tika-parsers/src/test/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniserTest.java tika-parsers/src/main/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniser.java tika-parsers/src/main/resources/org/apache/tika/parser/ner/nltk/NLTKServer.properties fix for TIKA-1876 contributed by manalishah (manalishah.91: rev 7ebe007ec03088449f67619ef1e6cb564178b14b) tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-server/src/main/java/org/apache/tika/server/RichTextContentHandler.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java tika-parsers/src/main/java/org/apache/tika/parser/ner/NERecogniser.java tika-core/src/main/java/org/apache/tika/mime/MimeType.java CHANGES.txt tika-server/src/main/java/org/apache/tika/server/resource/UnpackerResource.java fix for TIKA-1876 contributed by manalishah (manalishah.91: rev c809690ec87ffa600018dbc5eee6d6756645adb0) .gitignore tika-parsers/src/main/resources/org/apache/tika/parser/ner/nltk/NLTKServer.properties tika-parsers/src/main/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniser.java tika-parsers/src/test/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniserTest.java fix for TIKA-1876 contributed by manalishah (manalishah.91: rev 3a7e24c9a5d77ae41bde0c2106233a2064c5e707) .gitignore fix for TIKA-1876 contributed by manalishah (manalishah.91: rev 114d0ff24bd04395852012a3382d50c3e906e6db) tika-parsers/pom.xml fix for TIKA-1876 contributed by manalishah (manalishah.91: rev cdb684d9c1b0ebb01a783180f07417760fa04d6f) tika-parsers/src/main/java/org/apache/tika/parser/ner/nltk/NLTKNERecogniser.java Fix for TIKA-1876 Integrate Natural Language Toolkit (NLTK) into Tika (mattmann: rev 3fbc03cead1c54bd023a19e52e31609b51926d7d) CHANGES.txt

          People

          • Assignee:
            chrismattmann Chris A. Mattmann
            Reporter:
            manalishah.91@gmail.com Manali Shah
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development