Solr
  1. Solr
  2. SOLR-1902

Tika no longer properly extracts content in Solr

    Details

      Description

      See http://www.lucidimagination.com/search/document/2ca3fe953038a54f/problem_with_pdf_upgrading_cell#22360c8261801f24

      It appears that since the upgrade to Tika 0.7, Tika is now selecting an EmptyParser when uploading docs, which then outputs an empty XHTML representation. Still, it's strange that the tests pass.

        Issue Links

          Activity

          Grant Ingersoll created issue -
          Grant Ingersoll made changes -
          Field Original Value New Value
          Assignee Grant Ingersoll [ gsingers ]
          Hide
          Grant Ingersoll added a comment -

          Further debugging shows that on startup, Tika did not load any parsers, which is the difference as to why the tests pass.

          Show
          Grant Ingersoll added a comment - Further debugging shows that on startup, Tika did not load any parsers, which is the difference as to why the tests pass.
          Jukka Zitting made changes -
          Link This issue is blocked by TIKA-419 [ TIKA-419 ]
          Hide
          Grant Ingersoll added a comment -

          Upgraded to Tika 0.8-SNAPSHOT and added class loading capabilities.

          Show
          Grant Ingersoll added a comment - Upgraded to Tika 0.8-SNAPSHOT and added class loading capabilities.
          Grant Ingersoll made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Erik Hatcher added a comment -

          Is there a test case that could have caught this issue that we can add?

          Show
          Erik Hatcher added a comment - Is there a test case that could have caught this issue that we can add?
          Hide
          Grant Ingersoll added a comment -

          I suppose one could setup a Jetty fire off test to do it.

          Show
          Grant Ingersoll added a comment - I suppose one could setup a Jetty fire off test to do it.
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hoss Man made changes -
          Fix Version/s 4.0 [ 12314992 ]
          Hide
          Brad Greenlee added a comment -

          I am still seeing this issue. It works if I downgrade Tika to 0.6, but neither the 0.8-SNAPSHOT included in the current Solr trunk nor a snapshot from the Tika trunk work for me. I'm running Java 1.6.0_20 on OS X 10.6.3. I posted about the issue to the solr-user mailing list: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html

          Show
          Brad Greenlee added a comment - I am still seeing this issue. It works if I downgrade Tika to 0.6, but neither the 0.8-SNAPSHOT included in the current Solr trunk nor a snapshot from the Tika trunk work for me. I'm running Java 1.6.0_20 on OS X 10.6.3. I posted about the issue to the solr-user mailing list: http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html
          Hide
          Hoss Man added a comment -

          reopening based on mailing list discussion indicating problem is still in trunk

          Show
          Hoss Man added a comment - reopening based on mailing list discussion indicating problem is still in trunk
          Hoss Man made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          Grant Ingersoll added a comment -

          I'm not seeing this. I just tried trunk and it works for me. Brad, can you produce a test case? What happens if you run with extractOnly? Does it return the content? I tried both that and indexing using trunk and the example per the wiki docs and it all appears to work for me.

          Show
          Grant Ingersoll added a comment - I'm not seeing this. I just tried trunk and it works for me. Brad, can you produce a test case? What happens if you run with extractOnly? Does it return the content? I tried both that and indexing using trunk and the example per the wiki docs and it all appears to work for me.
          Hide
          David Thibault added a comment -

          I just tried this patch and the patch for ExtractingRequestHandler does not work when applied to the ExtractingRequestHandler from Solr 1.4.1. If it's a 1.4.0-specific patch maybe it should say something to that effect. I was able to read the patch and manually change the code, though. I have not yet tried the resulting compiled classes to see if they fix my issue, though.

          Show
          David Thibault added a comment - I just tried this patch and the patch for ExtractingRequestHandler does not work when applied to the ExtractingRequestHandler from Solr 1.4.1. If it's a 1.4.0-specific patch maybe it should say something to that effect. I was able to read the patch and manually change the code, though. I have not yet tried the resulting compiled classes to see if they fix my issue, though.
          Hide
          David Thibault added a comment -

          OK, I just did an ant clean dist with these patches applied. When I try to use curl to post a file to Solr it gives me this error:
          SEVERE: java.lang.NoSuchMethodError: org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
          at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
          at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
          at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
          at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
          at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
          at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
          at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
          at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
          at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
          at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
          at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
          at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
          at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
          at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
          at java.lang.Thread.run(Thread.java:619)

          I'm not sure why, because if I look at the patched Java source for SolrResourceLoader the getClassLoader() method is there. Any thoughts?

          Show
          David Thibault added a comment - OK, I just did an ant clean dist with these patches applied. When I try to use curl to post a file to Solr it gives me this error: SEVERE: java.lang.NoSuchMethodError: org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader; at org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555) at java.lang.Thread.run(Thread.java:619) I'm not sure why, because if I look at the patched Java source for SolrResourceLoader the getClassLoader() method is there. Any thoughts?
          Hide
          Tommaso Teofili added a comment -

          Hi all, I had the same issue David has, so I applied the patch (modifying files one by one) to a fresh Solr 1.4.1 checkout and I managed to have most of my PDFs being indexed with text extracted (with the "example" Solr instance).
          Within the apache-solr-1.4.1 release I substituted all the files inside apache-solr-1.4.1/dist with the ones generated (inside the dist directory) invoking 'ant dist' on the patched 1.4.1 source code, also I substituted the release war with the generated (patched) war inside example/webapps (this last one was mandatory to avoid the NoSuchMethodError reported above) . Then I ran 'java -jar start.jar' from example dir and everything worked.
          Note that I used the latest version of pdfbox, jembox and fontbox (1.2.1).
          I can attach the patch to 1.4.1 code I used.

          Show
          Tommaso Teofili added a comment - Hi all, I had the same issue David has, so I applied the patch (modifying files one by one) to a fresh Solr 1.4.1 checkout and I managed to have most of my PDFs being indexed with text extracted (with the "example" Solr instance). Within the apache-solr-1.4.1 release I substituted all the files inside apache-solr-1.4.1/dist with the ones generated (inside the dist directory) invoking 'ant dist' on the patched 1.4.1 source code, also I substituted the release war with the generated (patched) war inside example/webapps (this last one was mandatory to avoid the NoSuchMethodError reported above) . Then I ran 'java -jar start.jar' from example dir and everything worked. Note that I used the latest version of pdfbox, jembox and fontbox (1.2.1). I can attach the patch to 1.4.1 code I used.
          Tommaso Teofili made changes -
          Attachment SOLR1902_patch_to_141.txt [ 12450639 ]
          Hide
          David Thibault added a comment -

          OK, I tried Tommaso's patch and it worked great. However, using the solr.war that came with the 1.4.1 distribution, it still gave the NoSuchMethodError. However, when I try the patched and newly-recompiled apache-solr-1.4.2-dev.war, it worked. I thought I tried that before with the other patches and it didn't work. In any case, it seems to be working with the following:
          SOLR_DIST=the folder where the solr 1.4.1 distribution was unzipped.
          SOLR_HOME=the folder where tomcat or jetty will look to load SOLR.

          1) fresh copy of solr 1.4.1 distribution unzipped to SOLR_DIST

          2) update SOLR_DIST/contrib/extraction/lib with the following:
          jempbox-1.2.1.jar
          fontbox-1.2.1.jar
          pdfbox-1.2.1.jar
          tika-core-0.8-SNAPSHOT.jar
          tika-parsers-0.8-SNAPSHOT.jar
          (and remove old tika and pdfbox-related jars)

          3) patch with Tommaso's patch above in the SOLR_DIST directory:
          patch -p0 < SOLR1902_patch_to_141.txt

          4) still in SOLR_DIST, run "ant dist"

          5) copy SOLR_DIST/dist/*.jar to SOLR_HOME/lib
          6) copy SOLR_DIST/dist/solrj-lib to SOLR_HOME/lib/solrj-lib
          7) copy SOLR_DIST/dist/apache-solr-1.4.2-dev.war to SOLR_HOME/
          8) remove SOLR_HOME/contrib/extraction/lib/*.jar
          9) copy SOLR_DIST/contrib/extraction/lib/*.jar to SOLR_HOME/contrib/extraction/lib/
          10) (for me in tomcat) add CATALINA_HOME/conf/Catalina/localhost/solr.xml with the following content (substitute the actual directory for <SOLR_HOME> as that is not correct syntax):
          <?xml version="1.0" encoding="utf-8"?>
          <Context docBase="<SOLR_HOME>\apache-solr-1.4.2-dev.war" debug="0" crossContext="true">
          <Environment name="solr/home" type="java.lang.String" value="<SOLR_HOME>" override="true"/>
          </Context>
          11) restart tomcat.
          12) upload content via curl.
          13) jump for joy when it doesn't crash on me again...=)

          Best,
          Dave

          Show
          David Thibault added a comment - OK, I tried Tommaso's patch and it worked great. However, using the solr.war that came with the 1.4.1 distribution, it still gave the NoSuchMethodError. However, when I try the patched and newly-recompiled apache-solr-1.4.2-dev.war, it worked. I thought I tried that before with the other patches and it didn't work. In any case, it seems to be working with the following: SOLR_DIST=the folder where the solr 1.4.1 distribution was unzipped. SOLR_HOME=the folder where tomcat or jetty will look to load SOLR. 1) fresh copy of solr 1.4.1 distribution unzipped to SOLR_DIST 2) update SOLR_DIST/contrib/extraction/lib with the following: jempbox-1.2.1.jar fontbox-1.2.1.jar pdfbox-1.2.1.jar tika-core-0.8-SNAPSHOT.jar tika-parsers-0.8-SNAPSHOT.jar (and remove old tika and pdfbox-related jars) 3) patch with Tommaso's patch above in the SOLR_DIST directory: patch -p0 < SOLR1902_patch_to_141.txt 4) still in SOLR_DIST, run "ant dist" 5) copy SOLR_DIST/dist/*.jar to SOLR_HOME/lib 6) copy SOLR_DIST/dist/solrj-lib to SOLR_HOME/lib/solrj-lib 7) copy SOLR_DIST/dist/apache-solr-1.4.2-dev.war to SOLR_HOME/ 8) remove SOLR_HOME/contrib/extraction/lib/*.jar 9) copy SOLR_DIST/contrib/extraction/lib/*.jar to SOLR_HOME/contrib/extraction/lib/ 10) (for me in tomcat) add CATALINA_HOME/conf/Catalina/localhost/solr.xml with the following content (substitute the actual directory for <SOLR_HOME> as that is not correct syntax): <?xml version="1.0" encoding="utf-8"?> <Context docBase="<SOLR_HOME>\apache-solr-1.4.2-dev.war" debug="0" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="<SOLR_HOME>" override="true"/> </Context> 11) restart tomcat. 12) upload content via curl. 13) jump for joy when it doesn't crash on me again...=) Best, Dave
          Hide
          Grant Ingersoll added a comment -

          OK, so it seems that the comments on this were all about 1.4 and 1.4.1, which was never upgraded. So, I believe trunk is working. So, I'm going to mark this as a Fix Version for 1.4.2 as well and put up a batch for that based on the patch here.

          Show
          Grant Ingersoll added a comment - OK, so it seems that the comments on this were all about 1.4 and 1.4.1, which was never upgraded. So, I believe trunk is working. So, I'm going to mark this as a Fix Version for 1.4.2 as well and put up a batch for that based on the patch here.
          Hide
          Grant Ingersoll added a comment -

          Trunk, branch-1.4 (i.e. 1.4.2) and branch-3.x should all be on the same version of Tika at this point.

          Show
          Grant Ingersoll added a comment - Trunk, branch-1.4 (i.e. 1.4.2) and branch-3.x should all be on the same version of Tika at this point.
          Grant Ingersoll made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Fix Version/s 1.4.2 [ 12315231 ]
          Fix Version/s 3.1 [ 12314371 ]
          Resolution Fixed [ 1 ]
          Lance Norskog made changes -
          Link This issue is related to SOLR-2101 [ SOLR-2101 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Grant Ingersoll
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development