Solr
  1. Solr
  2. SOLR-2116

TikaEntityProcessor does not find parser by default

    Details

      Description

      The TikaEntityProcessor does not find the correct document parser by default.
      This is in a two-level DIH config file. I have attached pdflist-data-config.xml and pdflist.xml, the XML file list supplying. To test this, you will need the current 3.x branch or 4.0 trunk.

      1. Set up a Tika-enabled Solr
      2. copy any PDF file to /tmp/testfile.pdf
      3. copy the pdflist-data-config.xml to your solr/conf
      4. and add this snippet to your solrconfig.xml
        <requestHandler name="/pdflist"
              class="org.apache.solr.handler.dataimport.DataImportHandler">
          <lst name="defaults">
                      <str name="config">pdflist-data-config.xml</str>
              </lst>
        </requestHandler>
        

      http://localhost:8983/solr/pdflist?command=full-import will make one document with the id and text fields populated. If you remove this line:

       parser="org.apache.tika.parser.pdf.PDFParser"
      

      from the TikaEntityProcessor entity, the parser will not be found and you will get a document with the "id" field and nothing else.

      1. SOLR-2116.patch
        3 kB
        Martijn van Groningen
      2. pdflist-data-config.xml
        0.9 kB
        Lance Norskog
      3. pdflist.xml
        0.1 kB
        Lance Norskog

        Issue Links

          Activity

          Lance Norskog created issue -
          Lance Norskog made changes -
          Field Original Value New Value
          Attachment pdflist.xml [ 12454172 ]
          Attachment pdflist-data-config.xml [ 12454173 ]
          Hide
          Lance Norskog added a comment -

          It does not work if the parser= attribute is set to

          parser="org.apache.tika.parser.AutoDetectParser
          

          So, the AutoDetectParser does not work.

          Lance

          Show
          Lance Norskog added a comment - It does not work if the parser= attribute is set to parser="org.apache.tika.parser.AutoDetectParser So, the AutoDetectParser does not work. Lance
          Lance Norskog made changes -
          Link This issue is duplicated by SOLR-2101 [ SOLR-2101 ]
          Hide
          Martijn van Groningen added a comment - - edited

          I've encountered the same issue in my Solr setup. After some digging I found the problem, it is simply not loading classes from the lib directory.

          When no tika config is specified in the data-config.xml, the TikaEntityProcessor tries to load the TikaConfig in the manner specified below:

          ....
          String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
          if (tikaConfigFile == null) {
            tikaConfig = TikaConfig.getDefaultConfig();
          } else {
          ....
          

          The problem with this way of loading the TIkaConfig is, that it doesn't use the classloader from the SolrResourceLoader and therefore not loading any jars from the solr lib directory. The attached patch resolves the issue that no content is parsed by Tika. I simply use the constructor that requires a ClassLoader as argument. I retrieve the classloader from the SolrCore.

          ...
          String tikaConfigFile = context.getResolvedEntityAttribute("tikaConfig");
          if (tikaConfigFile == null) {
             ClassLoader classLoader = context.getSolrCore().getResourceLoader().getClassLoader();
             tikaConfig = new TikaConfig(classLoader);
          } else {
          ...
          

          I haven't added a test that demonstrates this bug, since it only occurs when Tika libs (and its dependencies) are in the Solr lib directory and I don't know how to replicate this situation in the solr build. The TestTikaEntityProcessor class doesn't have this problem since all classes are on the normal classpath when the build is running.

          Show
          Martijn van Groningen added a comment - - edited I've encountered the same issue in my Solr setup. After some digging I found the problem, it is simply not loading classes from the lib directory. When no tika config is specified in the data-config.xml, the TikaEntityProcessor tries to load the TikaConfig in the manner specified below: .... String tikaConfigFile = context.getResolvedEntityAttribute( "tikaConfig" ); if (tikaConfigFile == null ) { tikaConfig = TikaConfig.getDefaultConfig(); } else { .... The problem with this way of loading the TIkaConfig is, that it doesn't use the classloader from the SolrResourceLoader and therefore not loading any jars from the solr lib directory. The attached patch resolves the issue that no content is parsed by Tika. I simply use the constructor that requires a ClassLoader as argument. I retrieve the classloader from the SolrCore. ... String tikaConfigFile = context.getResolvedEntityAttribute( "tikaConfig" ); if (tikaConfigFile == null ) { ClassLoader classLoader = context.getSolrCore().getResourceLoader().getClassLoader(); tikaConfig = new TikaConfig(classLoader); } else { ... I haven't added a test that demonstrates this bug, since it only occurs when Tika libs (and its dependencies) are in the Solr lib directory and I don't know how to replicate this situation in the solr build. The TestTikaEntityProcessor class doesn't have this problem since all classes are on the normal classpath when the build is running.
          Martijn van Groningen made changes -
          Attachment SOLR-2116.patch [ 12467372 ]
          Hide
          Lance Norskog added a comment -

          Great! I'll try it out on 3.x and trunk.

          Speaking of Tika, have you ever seen a tikaconfig file? I can't find on anywhere on the web or the Tika source.

          Show
          Lance Norskog added a comment - Great! I'll try it out on 3.x and trunk. Speaking of Tika, have you ever seen a tikaconfig file? I can't find on anywhere on the web or the Tika source.
          Hide
          Chris A. Mattmann added a comment -

          Hey Lance,

          Speaking of Tika, have you ever seen a tikaconfig file? I can't find on anywhere on the web or the Tika source

          In the later versions of Tika (I think since 0.7) we've went to an all Service Provider Interface (SPI) mechanism for Parser config and resource loading, obviating the need to have a tika config.xml file:

          https://issues.apache.org/jira/browse/TIKA-317

          However, you still have the option of specifying and using one. See:

          http://svn.apache.org/repos/asf/tika/tags/0.8/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java

          You can find an example of the XML-based Tika config here:

          http://svn.apache.org/repos/asf/tika/tags/0.6/tika-core/src/main/resources/org/apache/tika/

          Part of this is also due to the ParseContext which was introduced also as a configuration mechanism. See:

          https://issues.apache.org/jira/browse/TIKA-275

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Lance, Speaking of Tika, have you ever seen a tikaconfig file? I can't find on anywhere on the web or the Tika source In the later versions of Tika (I think since 0.7) we've went to an all Service Provider Interface (SPI) mechanism for Parser config and resource loading, obviating the need to have a tika config.xml file: https://issues.apache.org/jira/browse/TIKA-317 However, you still have the option of specifying and using one. See: http://svn.apache.org/repos/asf/tika/tags/0.8/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java You can find an example of the XML-based Tika config here: http://svn.apache.org/repos/asf/tika/tags/0.6/tika-core/src/main/resources/org/apache/tika/ Part of this is also due to the ParseContext which was introduced also as a configuration mechanism. See: https://issues.apache.org/jira/browse/TIKA-275 Cheers, Chris
          Hide
          David Smiley added a comment -

          I encountered this bug and fixed it independently just now, just as the patch file here does. This is how Solr Cell configures Tika too. I encountered this on 3x by using the example-DIH that comes which Solr that includes a core named "tika".

          Furthermore, I found a configuration bug in that core in solrconfig.xml in which the <dataDir> is specified as opposed to it just defaulting to the correct place. The result is that this core will erroneously use the sample example/solr/data directory which is bad.

          Can a committer please commit the patch and remove the dataDir in that tika core on branch 3x? This is a bug after all.

          Show
          David Smiley added a comment - I encountered this bug and fixed it independently just now, just as the patch file here does. This is how Solr Cell configures Tika too. I encountered this on 3x by using the example-DIH that comes which Solr that includes a core named "tika". Furthermore, I found a configuration bug in that core in solrconfig.xml in which the <dataDir> is specified as opposed to it just defaulting to the correct place. The result is that this core will erroneously use the sample example/solr/data directory which is bad. Can a committer please commit the patch and remove the dataDir in that tika core on branch 3x? This is a bug after all.
          Hide
          Hoss Man added a comment -

          I don't fully understand this yet, and it doens't have a test, but i see a patch and recent comments from david that the bug is really and the patch fixes the bug, so i'm going to look into it.

          Show
          Hoss Man added a comment - I don't fully understand this yet, and it doens't have a test, but i see a patch and recent comments from david that the bug is really and the patch fixes the bug, so i'm going to look into it.
          Hoss Man made changes -
          Assignee Hoss Man [ hossman ]
          Fix Version/s 3.1 [ 12314371 ]
          Fix Version/s 4.0 [ 12314992 ]
          Affects Version/s 3.1 [ 12314371 ]
          Affects Version/s 4.0 [ 12314992 ]
          Hide
          Hoss Man added a comment -

          Martijn: thank you very much for the patch.

          you are correct, the way the tests are all currently run, simulating class loader problems like this is pretty much impossible – even using the hooks we have to spin up a JettyServer wouldn't really help since it still runs in the same JVM and all the libs are already loaded.

          Show
          Hoss Man added a comment - Martijn: thank you very much for the patch. you are correct, the way the tests are all currently run, simulating class loader problems like this is pretty much impossible – even using the hooks we have to spin up a JettyServer wouldn't really help since it still runs in the same JVM and all the libs are already loaded.
          Hoss Man made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Hoss Man
              Reporter:
              Lance Norskog
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development