Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1787

Include Stanford Name Entity Recognition in Tika

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12
    • Fix Version/s: 1.12
    • Component/s: mime, parser
    • Labels:
    • Environment:

      Java 1.8, Mac OSX 10.11

    • Flags:
      Patch

      Description

      Using the Stanford Name Entity Recognition, Tika will be able to extract name entities like PERSON, ORGANIZATION, LOCATION, etc from the given text. The extracted name entities will be added to the metadata

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user TaichiHo opened a pull request:

        https://github.com/apache/tika/pull/62

        fix for TIKA-1787 contributed by Yueheng He

        Succeed in building using java 1.8.0_65.
        To see the effect, create a text file like the following.
        ```
        Good afternoon Rajat Raina, how are you today? Hi, I am Tom Brady. I go to school at Stanford University, which is located in California.
        ```
        Save it as test.ner and feed it to tika.
        ```
        java -classpath tika-app/target/tika-app-1.12-SNAPSHOT.jar org.apache.tika.cli.TikaCLI -m test.ner
        ```
        The result should look like this
        ```
        Content-Length: 137
        Content-Type: application/stanford-ner
        LOCATION: [California]
        ORGANIZATION: [Stanford University]
        PERSON: [Rajat Raina, Tom Brady]
        X-Parsed-By: org.apache.tika.parser.DefaultParser
        X-Parsed-By: org.apache.tika.parser.stanfordNer.StanfordNerParser
        resourceName: test.ner
        ```

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/TaichiHo/tika TIKA-1787

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/62.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #62


        commit b94331ece262bb8d8408dda7b22b6dc0bb69557e
        Author: Taichi <heyuehengtaichi@gmail.com>
        Date: 2015-11-05T22:47:22Z

        fix for TIKA-1787 contributed by Yueheng He


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user TaichiHo opened a pull request: https://github.com/apache/tika/pull/62 fix for TIKA-1787 contributed by Yueheng He Succeed in building using java 1.8.0_65. To see the effect, create a text file like the following. ``` Good afternoon Rajat Raina, how are you today? Hi, I am Tom Brady. I go to school at Stanford University, which is located in California. ``` Save it as test.ner and feed it to tika. ``` java -classpath tika-app/target/tika-app-1.12-SNAPSHOT.jar org.apache.tika.cli.TikaCLI -m test.ner ``` The result should look like this ``` Content-Length: 137 Content-Type: application/stanford-ner LOCATION: [California] ORGANIZATION: [Stanford University] PERSON: [Rajat Raina, Tom Brady] X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.stanfordNer.StanfordNerParser resourceName: test.ner ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/TaichiHo/tika TIKA-1787 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/62.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #62 commit b94331ece262bb8d8408dda7b22b6dc0bb69557e Author: Taichi <heyuehengtaichi@gmail.com> Date: 2015-11-05T22:47:22Z fix for TIKA-1787 contributed by Yueheng He
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Great work as a start, Yueheng He! The thing is directly binding to the library isn't possible due to the NLTK license (GPL): http://nlp.stanford.edu/software/CRF-NER.shtml#Download

        However, we can include NLTK in the form that Thamme Gowda did in #61 on Github - that is - he and I talked about a command line invocation of the tool that we could host on Github and then have Tika call it at runtime which means we wouldn't have to bind to the license.

        Let me think about this. Thank you!

        Show
        chrismattmann Chris A. Mattmann added a comment - Great work as a start, Yueheng He ! The thing is directly binding to the library isn't possible due to the NLTK license (GPL): http://nlp.stanford.edu/software/CRF-NER.shtml#Download However, we can include NLTK in the form that Thamme Gowda did in #61 on Github - that is - he and I talked about a command line invocation of the tool that we could host on Github and then have Tika call it at runtime which means we wouldn't have to bind to the license. Let me think about this. Thank you!
        Hide
        Yueheng Yueheng He added a comment -

        Oh sorry about not noticing that. Thank you for pointing that out, Professor!

        Please let me know if there is anything I can do.

        Show
        Yueheng Yueheng He added a comment - Oh sorry about not noticing that. Thank you for pointing that out, Professor! Please let me know if there is anything I can do.
        Hide
        thammegowda Thamme Gowda added a comment -

        With #61, The CoreNLP NER can be activated by following steps:

        • Add CoreNLP jars and models to classpath. If you are using maven, then add :
             <dependency>
                      <groupId>edu.stanford.nlp</groupId>
                      <artifactId>stanford-corenlp</artifactId>
                      <version>${corenlp.version}</version>
                  </dependency>
          
                 <!-- This is a HUGE FILE -->
                 <dependency>
                      <groupId>edu.stanford.nlp</groupId>
                      <artifactId>stanford-corenlp</artifactId>
                      <version>${corenlp.version}</version>
                      <classifier>models</classifier>
                  </dependency>
          
        • Set System property "ner.impl.class" to "org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser"
          You can do it either by calling `System.setProperty()` before instantiating tika parsers in code or via commandline by using "-Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser" while launching the JVM.
        • Activate the NamedEntityParser

        A demo project setup is at : https://github.com/thammegowda/tika-ner-corenlp

        Show
        thammegowda Thamme Gowda added a comment - With #61, The CoreNLP NER can be activated by following steps: Add CoreNLP jars and models to classpath. If you are using maven, then add : <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> </dependency> <!-- This is a HUGE FILE --> <dependency> <groupId>edu.stanford.nlp</groupId> <artifactId>stanford-corenlp</artifactId> <version>${corenlp.version}</version> <classifier>models</classifier> </dependency> Set System property "ner.impl.class" to "org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser" You can do it either by calling `System.setProperty()` before instantiating tika parsers in code or via commandline by using "-Dner.impl.class=org.apache.tika.parser.ner.corenlp.CoreNLPNERecogniser" while launching the JVM. Activate the NamedEntityParser A demo project setup is at : https://github.com/thammegowda/tika-ner-corenlp
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Great work Thamme Gowda and Yueheng He!

        Thamme - please take your docs below and add the to the wiki page. Thanks!

        [mattmann-0420740:~/tmp/tika1.12] mattmann% svn commit -m "Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62"
        Sending        .gitignore
        Sending        CHANGES.txt
        Sending        tika-parsers/pom.xml
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/NERecogniser.java
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNERecogniser.java
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/regex
        Adding         tika-parsers/src/main/java/org/apache/tika/parser/ner/regex/RegexNERecogniser.java
        Adding         tika-parsers/src/main/resources/org/apache/tika/parser/ner
        Adding         tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex
        Adding         tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex/ner-regex.txt
        Adding         tika-parsers/src/test/java/org/apache/tika/parser/ner
        Adding         tika-parsers/src/test/java/org/apache/tika/parser/ner/NamedEntityParserTest.java
        Adding         tika-parsers/src/test/java/org/apache/tika/parser/ner/regex
        Adding         tika-parsers/src/test/java/org/apache/tika/parser/ner/regex/RegexNERecogniserTest.java
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/get-models.sh
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex/ner-regex.txt
        Adding         tika-parsers/src/test/resources/org/apache/tika/parser/ner/tika-config.xml
        Transmitting file data ................
        Committed revision 1714835.
        [mattmann-0420740:~/tmp/tika1.12] mattmann% 
        
        Show
        chrismattmann Chris A. Mattmann added a comment - Great work Thamme Gowda and Yueheng He ! Thamme - please take your docs below and add the to the wiki page. Thanks! [mattmann-0420740:~/tmp/tika1.12] mattmann% svn commit -m "Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62" Sending .gitignore Sending CHANGES.txt Sending tika-parsers/pom.xml Adding tika-parsers/src/main/java/org/apache/tika/parser/ner Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/NERecogniser.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNERecogniser.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/regex Adding tika-parsers/src/main/java/org/apache/tika/parser/ner/regex/RegexNERecogniser.java Adding tika-parsers/src/main/resources/org/apache/tika/parser/ner Adding tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex Adding tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex/ner-regex.txt Adding tika-parsers/src/test/java/org/apache/tika/parser/ner Adding tika-parsers/src/test/java/org/apache/tika/parser/ner/NamedEntityParserTest.java Adding tika-parsers/src/test/java/org/apache/tika/parser/ner/regex Adding tika-parsers/src/test/java/org/apache/tika/parser/ner/regex/RegexNERecogniserTest.java Adding tika-parsers/src/test/resources/org/apache/tika/parser Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/get-models.sh Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex/ner-regex.txt Adding tika-parsers/src/test/resources/org/apache/tika/parser/ner/tika-config.xml Transmitting file data ................ Committed revision 1714835. [mattmann-0420740:~/tmp/tika1.12] mattmann%
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/62

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/62
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/61

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/61
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-trunk-jdk1.7 #887 (See https://builds.apache.org/job/tika-trunk-jdk1.7/887/)
        Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1714835)

        • trunk/.gitignore
        • trunk/CHANGES.txt
        • trunk/tika-parsers/pom.xml
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/NERecogniser.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNERecogniser.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/regex
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/regex/RegexNERecogniser.java
        • trunk/tika-parsers/src/main/resources/org/apache/tika/parser/ner
        • trunk/tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex
        • trunk/tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex/ner-regex.txt
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner/NamedEntityParserTest.java
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner/regex
        • trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner/regex/RegexNERecogniserTest.java
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/get-models.sh
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex/ner-regex.txt
        • trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/tika-config.xml
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #887 (See https://builds.apache.org/job/tika-trunk-jdk1.7/887/ ) Fix for TIKA-1787 : Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1714835 ) trunk/.gitignore trunk/CHANGES.txt trunk/tika-parsers/pom.xml trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/NERecogniser.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/corenlp/CoreNLPNERecogniser.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNERecogniser.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/opennlp/OpenNLPNameFinder.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/regex trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/regex/RegexNERecogniser.java trunk/tika-parsers/src/main/resources/org/apache/tika/parser/ner trunk/tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex trunk/tika-parsers/src/main/resources/org/apache/tika/parser/ner/regex/ner-regex.txt trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner/NamedEntityParserTest.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner/regex trunk/tika-parsers/src/test/java/org/apache/tika/parser/ner/regex/RegexNERecogniserTest.java trunk/tika-parsers/src/test/resources/org/apache/tika/parser trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/opennlp/get-models.sh trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/regex/ner-regex.txt trunk/tika-parsers/src/test/resources/org/apache/tika/parser/ner/tika-config.xml
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #889 (See https://builds.apache.org/job/tika-trunk-jdk1.7/889/)
        Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1714931)

        • trunk/tika-parsers/pom.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #889 (See https://builds.apache.org/job/tika-trunk-jdk1.7/889/ ) Fix for TIKA-1787 : Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1714931 ) trunk/tika-parsers/pom.xml

          People

          • Assignee:
            chrismattmann Chris A. Mattmann
            Reporter:
            Yueheng Yueheng He
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development