Details

      Description

      as the title says

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Bulk close after 3.5 is released

          Show
          Uwe Schindler added a comment - Bulk close after 3.5 is released
          Hide
          Jan Høydahl added a comment -

          Also fixed the dot.classpath for eclipse so that the new Tika jars are found

          Show
          Jan Høydahl added a comment - Also fixed the dot.classpath for eclipse so that the new Tika jars are found
          Hide
          Jan Høydahl added a comment -

          Done for trunk and 3.x

          Show
          Jan Høydahl added a comment - Done for trunk and 3.x
          Hide
          Jan Høydahl added a comment -

          Will commit Tika 0.10 with these jar changes:

          +   solr/contrib/extraction/lib/apache-mime4j-core-0.7.jar
          +   solr/contrib/extraction/lib/apache-mime4j-dom-0.7.jar
          -   solr/contrib/extraction/lib/fontbox-1.3.1.jar
          +   solr/contrib/extraction/lib/fontbox-1.6.0.jar
          -   solr/contrib/extraction/lib/jempbox-1.3.1.jar
          +   solr/contrib/extraction/lib/jempbox-1.6.0.jar
          -   solr/contrib/extraction/lib/netcdf-4.2.jar
          +   solr/contrib/extraction/lib/netcdf-4.2-min.jar
          -   solr/contrib/extraction/lib/pdfbox-1.3.1.jar
          +   solr/contrib/extraction/lib/pdfbox-1.6.0.jar
          -   solr/contrib/extraction/lib/poi-3.7.jar
          +   solr/contrib/extraction/lib/poi-3.8-beta4.jar
          -   solr/contrib/extraction/lib/poi-ooxml-3.7.jar
          +   solr/contrib/extraction/lib/poi-ooxml-3.8-beta4.jar
          -   solr/contrib/extraction/lib/poi-ooxml-schemas-3.7.jar
          +   solr/contrib/extraction/lib/poi-ooxml-schemas-3.8-beta4.jar
          -   solr/contrib/extraction/lib/poi-scratchpad-3.7.jar
          +   solr/contrib/extraction/lib/poi-scratchpad-3.8-beta4.jar
          -   solr/contrib/extraction/lib/tagsoup-1.2.jar
          +   solr/contrib/extraction/lib/tagsoup-1.2.1.jar
          -   solr/contrib/extraction/lib/tika-core-0.8.jar
          +   solr/contrib/extraction/lib/tika-core-0.10.jar
          -   solr/contrib/extraction/lib/tika-parsers-0.8.jar
          +   solr/contrib/extraction/lib/tika-parsers-0.10.jar
          
          Show
          Jan Høydahl added a comment - Will commit Tika 0.10 with these jar changes: + solr/contrib/extraction/lib/apache-mime4j-core-0.7.jar + solr/contrib/extraction/lib/apache-mime4j-dom-0.7.jar - solr/contrib/extraction/lib/fontbox-1.3.1.jar + solr/contrib/extraction/lib/fontbox-1.6.0.jar - solr/contrib/extraction/lib/jempbox-1.3.1.jar + solr/contrib/extraction/lib/jempbox-1.6.0.jar - solr/contrib/extraction/lib/netcdf-4.2.jar + solr/contrib/extraction/lib/netcdf-4.2-min.jar - solr/contrib/extraction/lib/pdfbox-1.3.1.jar + solr/contrib/extraction/lib/pdfbox-1.6.0.jar - solr/contrib/extraction/lib/poi-3.7.jar + solr/contrib/extraction/lib/poi-3.8-beta4.jar - solr/contrib/extraction/lib/poi-ooxml-3.7.jar + solr/contrib/extraction/lib/poi-ooxml-3.8-beta4.jar - solr/contrib/extraction/lib/poi-ooxml-schemas-3.7.jar + solr/contrib/extraction/lib/poi-ooxml-schemas-3.8-beta4.jar - solr/contrib/extraction/lib/poi-scratchpad-3.7.jar + solr/contrib/extraction/lib/poi-scratchpad-3.8-beta4.jar - solr/contrib/extraction/lib/tagsoup-1.2.jar + solr/contrib/extraction/lib/tagsoup-1.2.1.jar - solr/contrib/extraction/lib/tika-core-0.8.jar + solr/contrib/extraction/lib/tika-core-0.10.jar - solr/contrib/extraction/lib/tika-parsers-0.8.jar + solr/contrib/extraction/lib/tika-parsers-0.10.jar
          Hide
          Jan Høydahl added a comment -

          Tika 0.10 is a few days away, so we'll skip 0.9 and jump directly to 0.10
          Serious bugs fixed for both PDF, HTML and XLSX.
          http://search-lucene.com/m/kFMGc2BwzA4

          Show
          Jan Høydahl added a comment - Tika 0.10 is a few days away, so we'll skip 0.9 and jump directly to 0.10 Serious bugs fixed for both PDF, HTML and XLSX. http://search-lucene.com/m/kFMGc2BwzA4
          Hide
          Koji Sekiguchi added a comment -

          +1

          Show
          Koji Sekiguchi added a comment - +1
          Hide
          Jan Høydahl added a comment -

          I see that Tika 1.0 may be just around the corner, so waiting a few more days to see if it materializes, then we can jump directly to 1.0, which has a bunch of more bugs fixed, a newer PDFBox, and more flexible configuration of plugin parsers.

          Show
          Jan Høydahl added a comment - I see that Tika 1.0 may be just around the corner, so waiting a few more days to see if it materializes, then we can jump directly to 1.0, which has a bunch of more bugs fixed, a newer PDFBox, and more flexible configuration of plugin parsers.
          Hide
          Jan Høydahl added a comment -

          Here's the diff between old and what I plan to commit. Does it look right?

          Only in lib-0.9: apache-mime4j-0.6.jar
          Only in lib-0.9: apache-mime4j-LICENSE-ASL.txt
          Only in lib-0.9: apache-mime4j-NOTICE.txt
          Only in lib-0.8: fontbox-1.3.1.jar
          Only in lib-0.9: fontbox-1.4.0.jar
          Only in lib-0.8: jempbox-1.3.1.jar
          Only in lib-0.9: jempbox-1.4.0.jar
          Only in lib-0.9: netcdf-4.2-min.jar
          Only in lib-0.8: netcdf-4.2.jar
          Only in lib-0.8: pdfbox-1.3.1.jar
          Only in lib-0.9: pdfbox-1.4.0.jar
          Only in lib-0.8: tika-core-0.8.jar
          Only in lib-0.9: tika-core-0.9.jar
          Only in lib-0.8: tika-parsers-0.8.jar
          Only in lib-0.9: tika-parsers-0.9.jar

          PS: I've built the tika-jars using Java1.5, would that be an issue?

          Show
          Jan Høydahl added a comment - Here's the diff between old and what I plan to commit. Does it look right? Only in lib-0.9: apache-mime4j-0.6.jar Only in lib-0.9: apache-mime4j-LICENSE-ASL.txt Only in lib-0.9: apache-mime4j-NOTICE.txt Only in lib-0.8: fontbox-1.3.1.jar Only in lib-0.9: fontbox-1.4.0.jar Only in lib-0.8: jempbox-1.3.1.jar Only in lib-0.9: jempbox-1.4.0.jar Only in lib-0.9: netcdf-4.2-min.jar Only in lib-0.8: netcdf-4.2.jar Only in lib-0.8: pdfbox-1.3.1.jar Only in lib-0.9: pdfbox-1.4.0.jar Only in lib-0.8: tika-core-0.8.jar Only in lib-0.9: tika-core-0.9.jar Only in lib-0.8: tika-parsers-0.8.jar Only in lib-0.9: tika-parsers-0.9.jar PS: I've built the tika-jars using Java1.5, would that be an issue?
          Hide
          Jan Høydahl added a comment -

          Marking for 3.3 and bumping priority to major due to the good cost/benefit ratio, especially for PDF parsing.
          I'd love to contribute but I think this kind of change cannot be done with a patch.

          Show
          Jan Høydahl added a comment - Marking for 3.3 and bumping priority to major due to the good cost/benefit ratio, especially for PDF parsing. I'd love to contribute but I think this kind of change cannot be done with a patch.
          Hide
          Jan Høydahl added a comment -

          Aim at this for 3.3. Important PDF bugs fixed...

          Show
          Jan Høydahl added a comment - Aim at this for 3.3. Important PDF bugs fixed...

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Grant Ingersoll
            • Votes:
              6 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development