Details

      Description

      Tika 1.0 was released November 7th and includes a number of improvements: http://tika.apache.org/1.0/

      1. SOLR-2901.patch
        21 kB
        Jan Høydahl
      2. SOLR-2901.patch
        20 kB
        Jan Høydahl
      3. SOLR-2901.patch
        21 kB
        Jan Høydahl
      4. SOLR-2901.patch
        19 kB
        Jan Høydahl
      5. SOLR-2901.patch
        19 kB
        Jan Høydahl
      6. SOLR-2901.patch
        18 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          First patch version.

          • Tika 1.0 removes previous deprecations, so this patch changes how the API is used in a few places.
          • For MailEntityProcessor we also improve detection by passing part's fileName in as MetaData
          • For ExtractingDocumentLoader we now provide stream's content type as hint in MetaData, but this is not tested extensively..
          • Added tests for new languages detected
          • Updated eclipse classpath file to point to the new jars. Nothing done for other IDEs

          One place still uses a deprecated method, that is in ExtractingDocumentLoader where we say parser = config.getParser(mediaType) - did not find the new equivalent.

          Show
          Jan Høydahl added a comment - First patch version. Tika 1.0 removes previous deprecations, so this patch changes how the API is used in a few places. For MailEntityProcessor we also improve detection by passing part's fileName in as MetaData For ExtractingDocumentLoader we now provide stream's content type as hint in MetaData, but this is not tested extensively.. Added tests for new languages detected Updated eclipse classpath file to point to the new jars. Nothing done for other IDEs One place still uses a deprecated method, that is in ExtractingDocumentLoader where we say parser = config.getParser(mediaType) - did not find the new equivalent.
          Show
          Jan Høydahl added a comment - If you want to try this patch, you also need three jars to be put in contrib/extraction/lib: http://dl.dropbox.com/u/20080302/tikajars/commons-compress-1.3.jar http://dl.dropbox.com/u/20080302/tikajars/tika-core-1.0.jar http://dl.dropbox.com/u/20080302/tikajars/tika-parsers-1.0.jar
          Hide
          Robert Muir added a comment -

          Patch seems to work... though the test is more evidence in addition to Mike's experiments
          that something is seriously up with spanish/galician and tika's detector

          Show
          Robert Muir added a comment - Patch seems to work... though the test is more evidence in addition to Mike's experiments that something is seriously up with spanish/galician and tika's detector
          Hide
          Jan Høydahl added a comment -

          Thanks for looking at it. I'd prefer if the old spanish text would still have been detected as spanish Yet another proof that the Tika algorithm is not super strong with short texts of very similar languages, but as you say, "we knew that"..

          Show
          Jan Høydahl added a comment - Thanks for looking at it. I'd prefer if the old spanish text would still have been detected as spanish Yet another proof that the Tika algorithm is not super strong with short texts of very similar languages, but as you say, "we knew that"..
          Hide
          Jan Høydahl added a comment -

          Could someone fix the classpath config for IntelliJ IDEA in dev-tools?

          Show
          Jan Høydahl added a comment - Could someone fix the classpath config for IntelliJ IDEA in dev-tools?
          Hide
          Steve Rowe added a comment -

          Could someone fix the classpath config for IntelliJ IDEA in dev-tools?

          IntelliJ IDEA effectively grabs **/lib/*.jar for its classpath (where ** refers to all modules with lib/ directories), rather than referring to explicitly named jar files, so as long as you rename jars (or add or remove jars, for that matter) in library directories that were already there, nothing needs to be done.

          However, the Maven configuration will need fixing, since dependencies' versions are by contrast explicitly declared: In dev-tools/maven/pom.xml.template, the tika.version property setting should be changed from <tika.version>0.10</tika.version> to <tika.version>1.0</tika.version>. (This property is used in both the tika-core and the tika-parsers dependency version declarations in the <dependencyManagement> section in the same file.) The commons-compress dependency is handled through Maven's transitive dependency mechanism, since it's declared as a dependency in the tika-parsers POM, and so no configuration changes are required for it.

          Show
          Steve Rowe added a comment - Could someone fix the classpath config for IntelliJ IDEA in dev-tools? IntelliJ IDEA effectively grabs **/lib/*.jar for its classpath (where ** refers to all modules with lib/ directories), rather than referring to explicitly named jar files, so as long as you rename jars (or add or remove jars, for that matter) in library directories that were already there, nothing needs to be done. However, the Maven configuration will need fixing, since dependencies' versions are by contrast explicitly declared: In dev-tools/maven/pom.xml.template , the tika.version property setting should be changed from <tika.version>0.10</tika.version> to <tika.version>1.0</tika.version> . (This property is used in both the tika-core and the tika-parsers dependency version declarations in the <dependencyManagement> section in the same file.) The commons-compress dependency is handled through Maven's transitive dependency mechanism, since it's declared as a dependency in the tika-parsers POM, and so no configuration changes are required for it.
          Hide
          Jan Høydahl added a comment - - edited

          New patch. Bumps Tika version in CHANGES files. Replaces deprecated getParser(mt) (2nd take, this time DefaultParser):

          -      parser = config.getParser(mt);
          +      parser = new DefaultParser().getParsers().get(mt);
          
          Show
          Jan Høydahl added a comment - - edited New patch. Bumps Tika version in CHANGES files. Replaces deprecated getParser(mt) (2nd take, this time DefaultParser): - parser = config.getParser(mt); + parser = new DefaultParser().getParsers().get(mt);
          Hide
          Jan Høydahl added a comment -

          This even includes the pom.xml.template change

          Show
          Jan Høydahl added a comment - This even includes the pom.xml.template change
          Hide
          Jan Høydahl added a comment -

          Fixes bug in the new stream.type code

          Show
          Jan Høydahl added a comment - Fixes bug in the new stream.type code
          Hide
          Jan Høydahl added a comment -

          Cleanup excess imports. Think it's good to go.

          Show
          Jan Høydahl added a comment - Cleanup excess imports. Think it's good to go.
          Hide
          Jan Høydahl added a comment -

          Small change - respect potential custom tika config also when loading parser for stream.type. Added a few exceptional tests for wrong stream.type.

          Show
          Jan Høydahl added a comment - Small change - respect potential custom tika config also when loading parser for stream.type. Added a few exceptional tests for wrong stream.type.
          Hide
          Jan Høydahl added a comment -

          Checked in to trunk and merged back to 3x

          Show
          Jan Høydahl added a comment - Checked in to trunk and merged back to 3x
          Hide
          Jan Høydahl added a comment -

          Re-opening, as the jdom-1.0.jar must also be included, as a dependency for Rome used by FeedParser

          Show
          Jan Høydahl added a comment - Re-opening, as the jdom-1.0.jar must also be included, as a dependency for Rome used by FeedParser
          Hide
          Jan Høydahl added a comment -

          Checked in jdom-1.0.jar with LICENSE and NOTICE files in both 3.x and trunk.

          Show
          Jan Høydahl added a comment - Checked in jdom-1.0.jar with LICENSE and NOTICE files in both 3.x and trunk.

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Jan Høydahl
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development