Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1815

Text content from parser is empty when NamedEntityParser is enabled

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.15
    • Component/s: parser
    • Labels:

      Description

      When the NamedEntityParser is enabled, the Tika#parseToString() and other parse() methods produces an empty string.

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user thammegowda opened a pull request:

        https://github.com/apache/tika/pull/66

        Fix for TIKA-1815 contributed by Thamme Gowda

        + Outputting the text content to XMLDocumentHandler

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/thammegowda/tika fix-TIKA-1815

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/66.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #66


        commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-10-30T21:47:45Z

        Add NamedEntityParser

        Add OpenNLPNERecogniser as default

        commit a720507a1c1906a501470a7d5c5cec335412fcd3
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-10-30T22:16:11Z

        Set charset for converting text to stream

        commit 6b1a20e681a5d319886464ec147967c876b7e60d
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-10-31T04:23:43Z

        Automated OpenNLP NER model downloader

        commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-04T00:31:40Z

        using a secondary parser to convert non-text streams

        commit ea7871bd4afae7d18e500ffc285e58afd08f5e86
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-08T07:36:48Z

        Add regex based NER

        commit 084985b3612438e9ca7107fecdffd67757d04d10
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-08T07:38:17Z

        Add CoreNLP NER with runtime binding

        commit e4d74218ece77143d1e5245a3ef64ddf5578c310
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-08T23:41:15Z

        Added support for chaining NER implementations

        commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-09T05:58:58Z

        charset specified

        commit caba68773a287752dea43f3366e6d4309fde861c
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-10T01:34:04Z

        Merge branch 'trunk' of github.com:apache/tika into trunk

        commit 08b916790b279cda0201f2529ca58646dea4b2f9
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-10T19:06:29Z

        Resolved Code formatting issues

        + Removed star imports
        + Removed dead code / commented code
        + Added License header to missing files

        commit e07ac630d54cc79d9a7bfc9ac82332474d07434b
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-16T09:05:07Z

        Add missing doc strings, fix code formatting issues

        commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-11-18T03:03:41Z

        Fix: build phase for model downloader

        commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-12-11T14:33:36Z

        Merge branch 'trunk' of github.com:apache/tika into trunk

        commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-12-19T18:59:26Z

        Fix : TIKA-1815 by Thamme Gowda N.

        1. Writing text content to XMLContentHandler
        2. Added RegexNERParser to Default parser chain


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/66 Fix for TIKA-1815 contributed by Thamme Gowda + Outputting the text content to XMLDocumentHandler You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika fix- TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/66.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #66 commit e96da2bc28d5eef81d034e39eb05099ed5d38ac1 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-10-30T21:47:45Z Add NamedEntityParser Add OpenNLPNERecogniser as default commit a720507a1c1906a501470a7d5c5cec335412fcd3 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-10-30T22:16:11Z Set charset for converting text to stream commit 6b1a20e681a5d319886464ec147967c876b7e60d Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-10-31T04:23:43Z Automated OpenNLP NER model downloader commit e381ea88ebd2bb8f5adfe36d710acfce673e30aa Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-04T00:31:40Z using a secondary parser to convert non-text streams commit ea7871bd4afae7d18e500ffc285e58afd08f5e86 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-08T07:36:48Z Add regex based NER commit 084985b3612438e9ca7107fecdffd67757d04d10 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-08T07:38:17Z Add CoreNLP NER with runtime binding commit e4d74218ece77143d1e5245a3ef64ddf5578c310 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-08T23:41:15Z Added support for chaining NER implementations commit 7e6b43c83ec6cdd35ea258f52c0110ba986c82b3 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-09T05:58:58Z charset specified commit caba68773a287752dea43f3366e6d4309fde861c Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-10T01:34:04Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 08b916790b279cda0201f2529ca58646dea4b2f9 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-10T19:06:29Z Resolved Code formatting issues + Removed star imports + Removed dead code / commented code + Added License header to missing files commit e07ac630d54cc79d9a7bfc9ac82332474d07434b Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-16T09:05:07Z Add missing doc strings, fix code formatting issues commit 96d4d7cc29d4bcd8ac0cf7a595c39b6ed64d4d19 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-11-18T03:03:41Z Fix: build phase for model downloader commit 6d0b121b8b321e8a31257fc608bb001d3fe7afb5 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-12-11T14:33:36Z Merge branch 'trunk' of github.com:apache/tika into trunk commit 66d3a10ffabf1f54cff384ce1c7325c2a3c16279 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-12-19T18:59:26Z Fix : TIKA-1815 by Thamme Gowda N. 1. Writing text content to XMLContentHandler 2. Added RegexNERParser to Default parser chain
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user thammegowda opened a pull request:

        https://github.com/apache/tika/pull/67

        FIX for TIKA-1815 contributed by Thamme Gowda

        + Writing the text content to XML Document
        + Added Regex recogniser to default NER chain

        Closes #66 (this is a simpler version of the same). Fixes #TIKA-1815

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/thammegowda/tika TIKA-1815

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/67.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #67


        commit a40a18e2f61f2152fa065bda193ceb74e7e60c97
        Author: Thamme Gowda <tgowdan@gmail.com>
        Date: 2015-12-19T20:56:21Z

        FIX for TIKA-1815 contributed by Thamme Gowda

        + Writing the text content to XML Document
        + Added Regex recogniser to default NER chain


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user thammegowda opened a pull request: https://github.com/apache/tika/pull/67 FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain Closes #66 (this is a simpler version of the same). Fixes # TIKA-1815 You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/tika TIKA-1815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/67.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #67 commit a40a18e2f61f2152fa065bda193ceb74e7e60c97 Author: Thamme Gowda <tgowdan@gmail.com> Date: 2015-12-19T20:56:21Z FIX for TIKA-1815 contributed by Thamme Gowda + Writing the text content to XML Document + Added Regex recogniser to default NER chain
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user thammegowda closed the pull request at:

        https://github.com/apache/tika/pull/66

        Show
        githubbot ASF GitHub Bot added a comment - Github user thammegowda closed the pull request at: https://github.com/apache/tika/pull/66
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Applied and all tests pass, commiting:

        [INFO] Installing /Users/mattmann/tmp/tika1.12/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.12-SNAPSHOT/tika-1.12-SNAPSHOT.pom
        [INFO] ------------------------------------------------------------------------
        [INFO] Reactor Summary:
        [INFO] 
        [INFO] Apache Tika parent ................................. SUCCESS [  1.917 s]
        [INFO] Apache Tika core ................................... SUCCESS [ 18.468 s]
        [INFO] Apache Tika parsers ................................ SUCCESS [03:25 min]
        [INFO] Apache Tika XMP .................................... SUCCESS [  2.795 s]
        [INFO] Apache Tika serialization .......................... SUCCESS [  1.663 s]
        [INFO] Apache Tika batch .................................. SUCCESS [01:57 min]
        [INFO] Apache Tika application ............................ SUCCESS [ 38.660 s]
        [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 21.145 s]
        [INFO] Apache Tika translate .............................. SUCCESS [  1.832 s]
        [INFO] Apache Tika server ................................. SUCCESS [ 24.827 s]
        [INFO] Apache Tika examples ............................... SUCCESS [ 18.857 s]
        [INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.609 s]
        [INFO] Apache Tika ........................................ SUCCESS [  0.032 s]
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 07:36 min
        [INFO] Finished at: 2015-12-20T11:08:58-08:00
        [INFO] Final Memory: 102M/1708M
        
        Show
        chrismattmann Chris A. Mattmann added a comment - Applied and all tests pass, commiting: [INFO] Installing /Users/mattmann/tmp/tika1.12/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.12-SNAPSHOT/tika-1.12-SNAPSHOT.pom [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent ................................. SUCCESS [ 1.917 s] [INFO] Apache Tika core ................................... SUCCESS [ 18.468 s] [INFO] Apache Tika parsers ................................ SUCCESS [03:25 min] [INFO] Apache Tika XMP .................................... SUCCESS [ 2.795 s] [INFO] Apache Tika serialization .......................... SUCCESS [ 1.663 s] [INFO] Apache Tika batch .................................. SUCCESS [01:57 min] [INFO] Apache Tika application ............................ SUCCESS [ 38.660 s] [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 21.145 s] [INFO] Apache Tika translate .............................. SUCCESS [ 1.832 s] [INFO] Apache Tika server ................................. SUCCESS [ 24.827 s] [INFO] Apache Tika examples ............................... SUCCESS [ 18.857 s] [INFO] Apache Tika Java-7 Components ...................... SUCCESS [ 2.609 s] [INFO] Apache Tika ........................................ SUCCESS [ 0.032 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 07:36 min [INFO] Finished at: 2015-12-20T11:08:58-08:00 [INFO] Final Memory: 102M/1708M
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/67

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/67
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Applied thanks Thamme Gowda!

        [chipotle:~/tmp/tika1.12] mattmann% svn commit -m "Fix for TIKA-1815 Text content from parser is empty when NamedEntityParser is enabled contributed by Thamme Gowda <tgowdan@gmail.com> this closes #67"
        Sending        CHANGES.txt
        Sending        tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java
        Transmitting file data ..
        Committed revision 1721058.
        [chipotle:~/tmp/tika1.12] mattmann% 
        
        Show
        chrismattmann Chris A. Mattmann added a comment - Applied thanks Thamme Gowda ! [chipotle:~/tmp/tika1.12] mattmann% svn commit -m "Fix for TIKA-1815 Text content from parser is empty when NamedEntityParser is enabled contributed by Thamme Gowda <tgowdan@gmail.com> this closes #67" Sending CHANGES.txt Sending tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java Transmitting file data .. Committed revision 1721058. [chipotle:~/tmp/tika1.12] mattmann%
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-trunk-jdk1.7 #893 (See https://builds.apache.org/job/tika-trunk-jdk1.7/893/)
        Fix for TIKA-1815 Text content from parser is empty when NamedEntityParser is enabled contributed by Thamme Gowda <tgowdan@gmail.com> this closes #67 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1721058)

        • trunk/CHANGES.txt
        • trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #893 (See https://builds.apache.org/job/tika-trunk-jdk1.7/893/ ) Fix for TIKA-1815 Text content from parser is empty when NamedEntityParser is enabled contributed by Thamme Gowda <tgowdan@gmail.com> this closes #67 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1721058 ) trunk/CHANGES.txt trunk/tika-parsers/src/main/java/org/apache/tika/parser/ner/NamedEntityParser.java

          People

          • Assignee:
            chrismattmann Chris A. Mattmann
            Reporter:
            thammegowda Thamme Gowda
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 0.5h
              0.5h
              Remaining:
              Remaining Estimate - 0.5h
              0.5h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development