Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7137

Upgrade to Tika 1.7 in 4_10_3 branch

    Details

      Description

      I have been trying out SolrCell as an alternative to ingesting around 40M images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to 1.5. In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset of 50,000 are ingested when I run a series of 50k cURL commands to the extract handler. I had a feeling it has something to do with the fact that some of the characters extracted are oddball characters (4@#@#/ ^^^^) due to Tesseract not always extracting the right text. But then I remembered Tesseract didn't land in Tika until 1.7.

      So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a trivial patch to do so, attached (Tika + compress updates). Now all 50K images on the 50K subset are ingested, but I'm noticing something else weird. Despite the fact that Tesseract is called, and despite the fact that on certain images I can verify text is extracted by running Tesseract from the command line on that file, all I am getting in the "content" field of SolrCell is a bunch of "\n \n \n \n \n \n" text. So the text is extracted, there are weird characters, but they don't make it into Solr. Extremely odd.

        Issue Links

          Activity

          Hide
          chrismattmann Chris A. Mattmann added a comment -

          I originally set this to a blocker, sorry. To me this is a blocker, since now I have to use my alternative method to ingest (which pushes Tika to the client side using OODT; but I was hoping to do all of this with SolrCell). So to me it's at least critical/Major, but it's up to you guys to decide of course.

          Show
          chrismattmann Chris A. Mattmann added a comment - I originally set this to a blocker, sorry. To me this is a blocker, since now I have to use my alternative method to ingest (which pushes Tika to the client side using OODT; but I was hoping to do all of this with SolrCell). So to me it's at least critical/Major, but it's up to you guys to decide of course.
          Hide
          thetaphi Uwe Schindler added a comment -

          Use Solr 5. It has Tika 1.7

          Show
          thetaphi Uwe Schindler added a comment - Use Solr 5. It has Tika 1.7
          Hide
          thetaphi Uwe Schindler added a comment -

          We will not update TIKA in a bugfix version.

          You can replace the JAR files easily in Solr 4.10, so it would run with TIKA 1.7 (just some test cases may fail because of some minor problems). But the problem with Tesseract parser is a different issue. We may fix it in a possible 4.10.4 release.

          Show
          thetaphi Uwe Schindler added a comment - We will not update TIKA in a bugfix version. You can replace the JAR files easily in Solr 4.10, so it would run with TIKA 1.7 (just some test cases may fail because of some minor problems). But the problem with Tesseract parser is a different issue. We may fix it in a possible 4.10.4 release.
          Hide
          thetaphi Uwe Schindler added a comment -

          Sorry, resolution was wrong

          Show
          thetaphi Uwe Schindler added a comment - Sorry, resolution was wrong
          Hide
          thetaphi Uwe Schindler added a comment -

          FYI, the patch for a full upgrade to TIKA 1.7 is in SOLR-6991 (yours is missing to upgrade all required dependencies).

          Show
          thetaphi Uwe Schindler added a comment - FYI, the patch for a full upgrade to TIKA 1.7 is in SOLR-6991 (yours is missing to upgrade all required dependencies).
          Hide
          thetaphi Uwe Schindler added a comment -

          See also SOLR-6488 (must be applied first).

          Show
          thetaphi Uwe Schindler added a comment - See also SOLR-6488 (must be applied first).
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Wow that's a lot Uwe! My patch worked for me on the 4_10_3 branch, along with SOLR-7139 (and updating solrconfig.xml to not set div tags to _ignored), so maybe it's b/c I'm only doing image parsing/OCR and not running any of the other deps.

          Show
          chrismattmann Chris A. Mattmann added a comment - Wow that's a lot Uwe! My patch worked for me on the 4_10_3 branch, along with SOLR-7139 (and updating solrconfig.xml to not set div tags to _ignored), so maybe it's b/c I'm only doing image parsing/OCR and not running any of the other deps.
          Hide
          thetaphi Uwe Schindler added a comment -

          Most of those patches is hashes of JAR files and some License changes. In fact, its indeed enough to update the ivy.properties file with all upgraded versions; it would just be the release process and validation tasks of Solr not pass.

          As said before, it is also enough to download TIKA 1.7 and drop its JAR files into the contrib/extraction/lib folder of your Solr installation

          Show
          thetaphi Uwe Schindler added a comment - Most of those patches is hashes of JAR files and some License changes. In fact, its indeed enough to update the ivy.properties file with all upgraded versions; it would just be the release process and validation tasks of Solr not pass. As said before, it is also enough to download TIKA 1.7 and drop its JAR files into the contrib/extraction/lib folder of your Solr installation
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Thanks dude got it

          Show
          chrismattmann Chris A. Mattmann added a comment - Thanks dude got it

            People

            • Assignee:
              thetaphi Uwe Schindler
              Reporter:
              chrismattmann Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development