Solr
  1. Solr
  2. SOLR-4908

SolrContentHandler procuces glued words when extracting html

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 4.3
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      Windows 7, 64bit, Solr 4.3 example

      Description

      The SolrContentHandler produces glued words when extracting html

      for html documents like:

      <html><head></head><body>glued<br/>words</body></html>
      

      This was solved in Tika TIKA-343 but the problem occurs when using the extraction handler because the SolrContentHandler discards ignoreableWhitespace.
      The Tika XHTMLContentHandler issues ignoreableWhitspace events with a newline in the character stream when a <br> tag is encountered.

      The SolrContentHandler should be modified to add the ignorable whitespace to the content.

      Reproduction Steps:

      1. POST the html example file from the attachtments to http://localhost:8983/solr/update/extract?literal.id=html-test-1&commit=true (e.g. with curl or HTTP Requester Plugin in Firefox)
      2. Query for the document http://localhost:8983/solr/collection1/select?q=id%3A%22html-test-1%22&fl=content&wt=xml&indent=true
      3. Look for the field content, which contains the word "Shouldnotbeglued"
      1. tika-test.html
        0.1 kB
        Markus Schuch

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 1512297 from Uwe Schindler in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1512297 ]

          Merged revision(s) 1512296 from lucene/dev/trunk:
          SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

          Show
          ASF subversion and git services added a comment - Commit 1512297 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512297 ] Merged revision(s) 1512296 from lucene/dev/trunk: SOLR-4679 , SOLR-4908 , SOLR-5124 : Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.
          Hide
          ASF subversion and git services added a comment -

          Commit 1512296 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1512296 ]

          SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

          Show
          ASF subversion and git services added a comment - Commit 1512296 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1512296 ] SOLR-4679 , SOLR-4908 , SOLR-5124 : Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.
          Hide
          Hoss Man added a comment -

          resolving as a Dup of SOLR-4679, but thank you so much for your investigation into the root cause ... that really helps.

          The Tika XHTMLContentHandler issues ignoreableWhitspace events with a newline in the character stream when a <br> tag is encountered.

          The SolrContentHandler should be modified to add the ignorable whitespace to the content.

          I don't believe this modification to SolrContentHandler would actually make sense – the fact that <br> tags in html only produce ignorableWhitespace events in the resulting XHTML SAX stream seems like a bug in Tika, so i've opened TIKA-1134 to try to get to the bottom of it.

          Show
          Hoss Man added a comment - resolving as a Dup of SOLR-4679 , but thank you so much for your investigation into the root cause ... that really helps. The Tika XHTMLContentHandler issues ignoreableWhitspace events with a newline in the character stream when a <br> tag is encountered. The SolrContentHandler should be modified to add the ignorable whitespace to the content. I don't believe this modification to SolrContentHandler would actually make sense – the fact that <br> tags in html only produce ignorableWhitespace events in the resulting XHTML SAX stream seems like a bug in Tika, so i've opened TIKA-1134 to try to get to the bottom of it.

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Schuch
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development