Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: parser
    • Labels:
      None

      Description

      There have been significant improvements in Tika 0.10 and it would be nice to use the latest Tika in 1.4.

      1. NUTCH-1154.diff
        3 kB
        Andrzej Bialecki

        Activity

        Hide
        Markus Jelsma added a comment -

        Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220

        Show
        Markus Jelsma added a comment - Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #3 (See https://builds.apache.org/job/nutch-trunk-maven/3/)
        NUTCH-1154 Upgrade to Tika 0.10.

        ab : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1181665
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/ivy/ivy.xml
        • /nutch/trunk/src/plugin/parse-tika/ivy.xml
        • /nutch/trunk/src/plugin/parse-tika/plugin.xml
        • /nutch/trunk/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestRTFParser.java
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #3 (See https://builds.apache.org/job/nutch-trunk-maven/3/ ) NUTCH-1154 Upgrade to Tika 0.10. ab : http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1181665 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/ivy/ivy.xml /nutch/trunk/src/plugin/parse-tika/ivy.xml /nutch/trunk/src/plugin/parse-tika/plugin.xml /nutch/trunk/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestRTFParser.java
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #1631 (See https://builds.apache.org/job/Nutch-trunk/1631/)
        NUTCH-1154 Upgrade to Tika 0.10.

        ab : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181665
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/ivy/ivy.xml
        • /nutch/trunk/src/plugin/parse-tika/ivy.xml
        • /nutch/trunk/src/plugin/parse-tika/plugin.xml
        • /nutch/trunk/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestRTFParser.java
        Show
        Hudson added a comment - Integrated in Nutch-trunk #1631 (See https://builds.apache.org/job/Nutch-trunk/1631/ ) NUTCH-1154 Upgrade to Tika 0.10. ab : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1181665 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/ivy/ivy.xml /nutch/trunk/src/plugin/parse-tika/ivy.xml /nutch/trunk/src/plugin/parse-tika/plugin.xml /nutch/trunk/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestRTFParser.java
        Hide
        Hudson added a comment -

        Integrated in Nutch-nutchgora #32 (See https://builds.apache.org/job/Nutch-nutchgora/32/)
        NUTCH-1154 Upgrade to Tika 0.10.

        ab : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1181758
        Files :

        • /nutch/branches/nutchgora/CHANGES.txt
        • /nutch/branches/nutchgora/ivy/ivy.xml
        • /nutch/branches/nutchgora/src/plugin/parse-tika/ivy.xml
        • /nutch/branches/nutchgora/src/plugin/parse-tika/plugin.xml
        • /nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRTFParser.java
        Show
        Hudson added a comment - Integrated in Nutch-nutchgora #32 (See https://builds.apache.org/job/Nutch-nutchgora/32/ ) NUTCH-1154 Upgrade to Tika 0.10. ab : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1181758 Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/ivy/ivy.xml /nutch/branches/nutchgora/src/plugin/parse-tika/ivy.xml /nutch/branches/nutchgora/src/plugin/parse-tika/plugin.xml /nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRTFParser.java
        Hide
        Andrzej Bialecki added a comment -

        Committed in rev. 1181665.

        Show
        Andrzej Bialecki added a comment - Committed in rev. 1181665.
        Hide
        Lewis John McGibbney added a comment -

        Good luck with that one then. Thumbs up from me, it looks like the benefits far outweigh the loss of our beloved TestRTFParser for a small period of time (sob sob). +1

        Show
        Lewis John McGibbney added a comment - Good luck with that one then. Thumbs up from me, it looks like the benefits far outweigh the loss of our beloved TestRTFParser for a small period of time (sob sob). +1
        Hide
        Chris A. Mattmann added a comment -

        +1, I'm fine with disabling the test and upgrading to 0.10. I hope to get 1.0 out the door before ApacheCon (wish us luck) in which case the test will only be disabled for a short time.

        Show
        Chris A. Mattmann added a comment - +1, I'm fine with disabling the test and upgrading to 0.10. I hope to get 1.0 out the door before ApacheCon (wish us luck) in which case the test will only be disabled for a short time.
        Hide
        Andrzej Bialecki added a comment -

        The case for inclusion is here http://s.apache.org/vR that is, Tika 0.10 has several important improvements over 0.9.

        With the attached patch all tests pass except TestRTFParser, due to an issue that just has been fixed in Tika trunk. The underlying problem is that our test document is malformed and Tika's new RTF parser wasn't robust enough to handle this.

        This means that for now we would have to disable this test, and re-enable it once we upgrade to Tika 1.0.

        Show
        Andrzej Bialecki added a comment - The case for inclusion is here http://s.apache.org/vR that is, Tika 0.10 has several important improvements over 0.9. With the attached patch all tests pass except TestRTFParser, due to an issue that just has been fixed in Tika trunk. The underlying problem is that our test document is malformed and Tika's new RTF parser wasn't robust enough to handle this. This means that for now we would have to disable this test, and re-enable it once we upgrade to Tika 1.0.
        Hide
        Lewis John McGibbney added a comment -

        Hi Andrzej, if there is a strong case for inclusion in the forthcoming 1.4 release then I say we fire on with this. Do you have any indication as to what would need to be done to resolve the tests once this has been committed?

        Show
        Lewis John McGibbney added a comment - Hi Andrzej, if there is a strong case for inclusion in the forthcoming 1.4 release then I say we fire on with this. Do you have any indication as to what would need to be done to resolve the tests once this has been committed?
        Hide
        Andrzej Bialecki added a comment -

        TIKA-748 has been fixed and is scheduled to be included in Tika 1.0. If there are not objections I'd like to commit Tika 0.10, put a comment in CHANGES.txt, and disable this part of the test until we upgrade to Tika 1.0.

        Show
        Andrzej Bialecki added a comment - TIKA-748 has been fixed and is scheduled to be included in Tika 1.0. If there are not objections I'd like to commit Tika 0.10, put a comment in CHANGES.txt, and disable this part of the test until we upgrade to Tika 1.0.
        Hide
        Andrzej Bialecki added a comment -

        Patch to upgrade to Tika 0.10. Unfortunately, TestRTFParser fails with this version of Tika - the extracted body of the text is empty. See TIKA-748. Still, I think the improvements in PDF and Office parsers are worth the upgrade.

        Show
        Andrzej Bialecki added a comment - Patch to upgrade to Tika 0.10. Unfortunately, TestRTFParser fails with this version of Tika - the extracted body of the text is empty. See TIKA-748 . Still, I think the improvements in PDF and Office parsers are worth the upgrade.

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development