Solr
  1. Solr
  2. SOLR-7139

SolrContentHandler for TIKA is broken by TikaOCR (caused by multiple startDocument events)

    Details

      Description

      While testing my large scale Tika/SolrCell indexing (great work on /extraction guys, really really appreciate it) on my 40M image dataset, I was pulling my frickin' hair out trying to figure out why the TesseractOCR extracted content wasn't actually making it into the index. Well I figured it out lol (many many System.out.printlns later) - it's the disabling of div tags (=>ignored) in the default solrconfig.xml. This basically renders TesseractOCR output in SolrCell useless since it is surrounded by a div tag.

      1. SOLR-7139.Mattmann.022115.patch.txt
        1 kB
        Chris A. Mattmann
      2. SOLR-7139.patch
        5 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -

          More info on this - apparently even removing the ignored default params in SolrCell doesn't fix it. I'm debugging more will report more soon.

          Show
          Chris A. Mattmann added a comment - More info on this - apparently even removing the ignored default params in SolrCell doesn't fix it. I'm debugging more will report more soon.
          Hide
          Chris A. Mattmann added a comment -

          OK here is some more info: apparently startDocument is invoked more than 1x on the TesseractOCR Tika output, causing the SolrContentHandler to reset builder fields, making the div field get stomped.

          Show
          Chris A. Mattmann added a comment - OK here is some more info: apparently startDocument is invoked more than 1x on the TesseractOCR Tika output, causing the SolrContentHandler to reset builder fields, making the div field get stomped.
          Hide
          Chris A. Mattmann added a comment -

          Got it! It's because of TIKA-1445 - to make sure we get ImageParser metadata, along with OCR text, we had to call the handler 2x. That's why startDocument is invoked 2x. There is probably a workaround though to be done in Solr, which is to flush the current field builder information if it hasn't been flushed yet. Going to whip up a patch, see if it fixes it, and if so, throw it up.

          Show
          Chris A. Mattmann added a comment - Got it! It's because of TIKA-1445 - to make sure we get ImageParser metadata, along with OCR text, we had to call the handler 2x. That's why startDocument is invoked 2x. There is probably a workaround though to be done in Solr, which is to flush the current field builder information if it hasn't been flushed yet. Going to whip up a patch, see if it fixes it, and if so, throw it up.
          Show
          Chris A. Mattmann added a comment - You can see the relevant Tika code here: http://svn.apache.org/repos/asf/tika/tags/1.7/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          Hide
          Chris A. Mattmann added a comment -
          • trivial patch that fixes SolrCell for OCR in 4.10.3 branch.
          Show
          Chris A. Mattmann added a comment - trivial patch that fixes SolrCell for OCR in 4.10.3 branch.
          Hide
          Chris A. Mattmann added a comment -

          ok all fixed, hope you find it useful!

          Show
          Chris A. Mattmann added a comment - ok all fixed, hope you find it useful!
          Hide
          Uwe Schindler added a comment -

          Hi Chris,
          I understand the problem. We can add the workaround to SolrContentHandler until this double "startDocument" problem is solved in TIKA.

          Show
          Uwe Schindler added a comment - Hi Chris, I understand the problem. We can add the workaround to SolrContentHandler until this double "startDocument" problem is solved in TIKA.
          Hide
          Uwe Schindler added a comment -

          Chris A. Mattmann pointed me to TIKA-1445. The problems described there are causing the duplicate startDocument events. For TIKA 1.7 we can only work around by filtering them.

          Show
          Uwe Schindler added a comment - Chris A. Mattmann pointed me to TIKA-1445 . The problems described there are causing the duplicate startDocument events. For TIKA 1.7 we can only work around by filtering them.
          Hide
          Uwe Schindler added a comment -

          Hi,
          I analyzed the whole thing. Basically, the simplest fix is to remove the whole startDocument() method because it does not do anything useful. The whole setup ffor a new document is already done by the constructor.
          The startDocument setup looks like the original code writer wanted to "reuse" instances. But in fact this is never done (I checked extraction and morphlines).
          I will attach a patch that removes the startDocument() and adds documentation to javadocs that you can only process one document.

          Show
          Uwe Schindler added a comment - Hi, I analyzed the whole thing. Basically, the simplest fix is to remove the whole startDocument() method because it does not do anything useful. The whole setup ffor a new document is already done by the constructor. The startDocument setup looks like the original code writer wanted to "reuse" instances. But in fact this is never done (I checked extraction and morphlines). I will attach a patch that removes the startDocument() and adds documentation to javadocs that you can only process one document.
          Hide
          Uwe Schindler added a comment - - edited

          Patch. This also makes the fields final to ensure no accidental reuse & co.

          I would like to get this also in 4.10.4, because Mattman set this fix version and set "critical".

          Show
          Uwe Schindler added a comment - - edited Patch. This also makes the fields final to ensure no accidental reuse & co. I would like to get this also in 4.10.4, because Mattman set this fix version and set "critical".
          Hide
          ASF subversion and git services added a comment -

          Commit 1662457 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1662457 ]

          SOLR-7139: Fix SolrContentHandler for TIKA to ignore multiple startDocument events

          Show
          ASF subversion and git services added a comment - Commit 1662457 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1662457 ] SOLR-7139 : Fix SolrContentHandler for TIKA to ignore multiple startDocument events
          Hide
          ASF subversion and git services added a comment -

          Commit 1662461 from Uwe Schindler in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1662461 ]

          Merged revision(s) 1662457 from lucene/dev/trunk:
          SOLR-7139: Fix SolrContentHandler for TIKA to ignore multiple startDocument events

          Show
          ASF subversion and git services added a comment - Commit 1662461 from Uwe Schindler in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1662461 ] Merged revision(s) 1662457 from lucene/dev/trunk: SOLR-7139 : Fix SolrContentHandler for TIKA to ignore multiple startDocument events
          Hide
          ASF subversion and git services added a comment -

          Commit 1662462 from Uwe Schindler in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1662462 ]

          Merged revision(s) 1662457 from lucene/dev/trunk:
          SOLR-7139: Fix SolrContentHandler for TIKA to ignore multiple startDocument events

          Show
          ASF subversion and git services added a comment - Commit 1662462 from Uwe Schindler in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662462 ] Merged revision(s) 1662457 from lucene/dev/trunk: SOLR-7139 : Fix SolrContentHandler for TIKA to ignore multiple startDocument events
          Hide
          Chris A. Mattmann added a comment -

          Thank you Uwe Schindler! You rock!

          Show
          Chris A. Mattmann added a comment - Thank you Uwe Schindler ! You rock!
          Hide
          Uwe Schindler added a comment -

          Can you also confirm this in your case?

          Show
          Uwe Schindler added a comment - Can you also confirm this in your case?
          Hide
          Chris A. Mattmann added a comment -

          Sure I'll download the latest branch 4_10 and then test it out ASAP and report back.

          Show
          Chris A. Mattmann added a comment - Sure I'll download the latest branch 4_10 and then test it out ASAP and report back.
          Show
          Uwe Schindler added a comment - There is already a RC: http://people.apache.org/~mikemccand/staging_area/lucene-solr-4.10.4-RC0-rev1662817/
          Hide
          Chris A. Mattmann added a comment -

          awesome thanks Uwe and Mike. Will report back.

          Show
          Chris A. Mattmann added a comment - awesome thanks Uwe and Mike. Will report back.
          Hide
          Chris A. Mattmann added a comment -

          sorry haven't had a chance to test this yet. Hopefully tomorrow or Friday.

          Show
          Chris A. Mattmann added a comment - sorry haven't had a chance to test this yet. Hopefully tomorrow or Friday.
          Hide
          Michael McCandless added a comment -

          Bulk close for 4.10.4 release

          Show
          Michael McCandless added a comment - Bulk close for 4.10.4 release

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development