Solr
  1. Solr
  2. SOLR-4679

HTML line breaks (<br>) are removed during indexing; causes wrong search results

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.2
    • Fix Version/s: 4.5
    • Labels:
      None
    • Environment:

      Windows Server 2008 R2, Java 6, Tomcat 7

      Description

      HTML line breaks (<br>, <BR>, <br/>, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space.

      Test-File:
      <html>
      <head>
      <title>Test mit HTML-Zeilenschaltungen</title>
      </head>
      <p>
      word1<br>word2<br/>
      Some other words, a special name like linz<br>and another special name - vienna
      </p>
      </html>

      The Solr-content-attribute contains the following text:
      Test mit HTML-Zeilenschaltungen
      word1word2
      Some other words, a special name like linzand another special name - vienna

      So we are not able to find the word "linz".

      We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler)

      1. external.htm
        0.2 kB
        Christoph Straßer
      2. Solr_HtmlLineBreak_Linz_NotFound.png
        89 kB
        Christoph Straßer
      3. Solr_HtmlLineBreak_Vienna.png
        108 kB
        Christoph Straßer
      4. SOLR-4679__weird_TIKA-1134.patch
        2 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Christoph Straßer added a comment -

          File to reproduce the issue.

          Show
          Christoph Straßer added a comment - File to reproduce the issue.
          Hide
          Christoph Straßer added a comment -

          Screenshots showing the issue within Solr-Console

          Show
          Christoph Straßer added a comment - Screenshots showing the issue within Solr-Console
          Hide
          Hoss Man added a comment -

          FYI, i've confirmed this isn't a general problem with Tika 1.3 – using the tika-app.jar with the "--text" option there is a newline generated in place of the <br/> tag, so something about Solr's use of Tika is loosing this information...

          hossman@frisbee:~/tmp$ cat external.htm 
          <html>
          <head>
          <title>Test mit HTML-Zeilenschaltungen</title>
          </head>
          <p>
          word1<br>word2<br/>
          Some other words, a special name like linz<br>and another special name - vienna
          </p>
          </html>hossman@frisbee:~/tmp$ java -jar tika-app-1.3.jar --text external.htm 
          
          word1
          word2
          
          Some other words, a special name like linz
          and another special name - vienna
          
          
          hossman@frisbee:~/tmp$ java -jar tika-app-1.3.jar --text external.htm | cat -vet
          $
          word1$
          word2$
          $
          Some other words, a special name like linz$
          and another special name - vienna$
          $
          $
          
          Show
          Hoss Man added a comment - FYI, i've confirmed this isn't a general problem with Tika 1.3 – using the tika-app.jar with the "--text" option there is a newline generated in place of the <br/> tag, so something about Solr's use of Tika is loosing this information... hossman@frisbee:~/tmp$ cat external.htm <html> <head> <title>Test mit HTML-Zeilenschaltungen</title> </head> <p> word1<br>word2<br/> Some other words, a special name like linz<br>and another special name - vienna </p> </html>hossman@frisbee:~/tmp$ java -jar tika-app-1.3.jar --text external.htm word1 word2 Some other words, a special name like linz and another special name - vienna hossman@frisbee:~/tmp$ java -jar tika-app-1.3.jar --text external.htm | cat -vet $ word1$ word2$ $ Some other words, a special name like linz$ and another special name - vienna$ $ $
          Hide
          Christoph Straßer added a comment -

          Thank you for checking Tika.

          As far as i understand http://wiki.apache.org/solr/ExtractingRequestHandler extracts XHTML, not text. Tika XHTML-option-output looks okay too.

          Root issue - like you said - probably somewhere within Solr.

          D:\temp\20130409>java -jar tika-app-1.3.jar --xml external.htm
          <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
          <head>
          <meta name="Content-Length" content="193"/>
          <meta name="Content-Encoding" content="windows-1252"/>
          <meta name="Content-Type" content="text/html; charset=windows-1252"/>
          <meta name="resourceName" content="external.htm"/>
          <meta name="dc:title" content="Test mit HTML-Zeilenschaltungen"/>
          <title>Test mit HTML-Zeilenschaltungen</title>
          </head>
          <body><p>
          word1
          word2
          
          Some other words, a special name like linz
          and another special name - vienna
          </p>
          
          </body></html>
          
          Show
          Christoph Straßer added a comment - Thank you for checking Tika. As far as i understand http://wiki.apache.org/solr/ExtractingRequestHandler extracts XHTML, not text. Tika XHTML-option-output looks okay too. Root issue - like you said - probably somewhere within Solr. D:\temp\20130409>java -jar tika-app-1.3.jar --xml external.htm <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="Content-Length" content="193"/> <meta name="Content-Encoding" content="windows-1252"/> <meta name="Content-Type" content="text/html; charset=windows-1252"/> <meta name="resourceName" content="external.htm"/> <meta name="dc:title" content="Test mit HTML-Zeilenschaltungen"/> <title>Test mit HTML-Zeilenschaltungen</title> </head> <body><p> word1 word2 Some other words, a special name like linz and another special name - vienna </p> </body></html>
          Hide
          Hoss Man added a comment -

          Right ... i wonder if somewhere in the flow of SAX events these newline are being treated as "ignorable whitespace" ... i can't imagine why they would be, but that's the best guess i have at the moment.

          Show
          Hoss Man added a comment - Right ... i wonder if somewhere in the flow of SAX events these newline are being treated as "ignorable whitespace" ... i can't imagine why they would be, but that's the best guess i have at the moment.
          Hide
          Hoss Man added a comment -

          i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for <br> tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

          It's possible that SolrContentHandler should be changed to "fix" this behavior, but not in any clear way i can see (if we start treating all ignorable whitespace as significant that would lead to other problems when there is in fact ignorable whitespace)

          Show
          Hoss Man added a comment - i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for <br> tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. It's possible that SolrContentHandler should be changed to "fix" this behavior, but not in any clear way i can see (if we start treating all ignorable whitespace as significant that would lead to other problems when there is in fact ignorable whitespace)
          Hide
          Hoss Man added a comment -

          patch includes a test demonstrating hte problem in Solr, and an example of how we could work around this in SolrContentHandler – but i don't think the workarround is a good idea ... not w/o a lot more careful thought about how all that extra ignorblae whitespace might affect people (not just from html docs, but from any other types of docs where Tika produces ignorable whitespace sax events)

          Show
          Hoss Man added a comment - patch includes a test demonstrating hte problem in Solr, and an example of how we could work around this in SolrContentHandler – but i don't think the workarround is a good idea ... not w/o a lot more careful thought about how all that extra ignorblae whitespace might affect people (not just from html docs, but from any other types of docs where Tika produces ignorable whitespace sax events)
          Hide
          Uwe Schindler added a comment - - edited

          There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear.

          i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for <br> tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

          I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to "corectly produce" ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes "block" XHTML elements like <p/>, <div/> also emit a newline as ignoreable on the closing element, see TIKA-171).

          FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace SAX event to report this "added whitespace". The rule that was choosen in TIKA is:

          • If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is only produced by TIKA so if it exists, you know that it is coming from TIKA.
          • If you want to keep the XHTML structure and you "understand" block tags and <br/>, then you can ignore the ignorable whitespace.

          Regarding this guideline, your patch is correct and should be applied to solr.

          Show
          Uwe Schindler added a comment - - edited There is another occurence of this bug with PDF files ( SOLR-5124 ). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for <br> tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to "corectly produce" ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes "block" XHTML elements like <p/>, <div/> also emit a newline as ignoreable on the closing element, see TIKA-171 ). FYI: "ignoreable whitespace" is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to "reuse" (its a bit "incorrect") the ignoreableWhitespace SAX event to report this "added whitespace". The rule that was choosen in TIKA is: If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is only produced by TIKA so if it exists, you know that it is coming from TIKA. If you want to keep the XHTML structure and you "understand" block tags and <br/>, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr.
          Hide
          Uwe Schindler added a comment - - edited

          The stuff with ignorableWhitespace was discussed between Jukka Zitting and me in TIKA-171. I think this was the issue when we decided to emit ignorableWhitespace for all "synthetic" whitespace added to support text-only extraction.

          Hoss Man: I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171. In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain "one more time" and document under which circumstances TIKA emits ignorableWhitepsace.

          Show
          Uwe Schindler added a comment - - edited The stuff with ignorableWhitespace was discussed between Jukka Zitting and me in TIKA-171 . I think this was the issue when we decided to emit ignorableWhitespace for all "synthetic" whitespace added to support text-only extraction. Hoss Man : I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171 . In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain "one more time" and document under which circumstances TIKA emits ignorableWhitepsace.
          Hide
          Hoss Man added a comment -

          Uwe: I defer to your judgement on this. if you think the patch is hte right way to go, then +1 from me.

          Show
          Hoss Man added a comment - Uwe: I defer to your judgement on this. if you think the patch is hte right way to go, then +1 from me.
          Hide
          Uwe Schindler added a comment -

          Hoss: I just took this issue because it was unassigned and I was the one mandating to add ignorable whitespace at that time in TIKA. So Jukka and I decided this would be the best.

          Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems:

          • TIKA decided to use XHTML as its output format to report the parsed documents to the consumer. This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...) originating from the original document. Of course most of this formatting is lost, but you can still "detect" things like emphasized text. By choosing XHTML as output format, of course TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed, the TIKA pasrer emits a <br/> tag or places the "paragraph" (in a PDF) inside a <p/> element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated as one single whitespace, so means like this regreplace: s/\s+/ /
          • On the other hand, TIKA wants to make it simple for people to extract the plain text contents. With the XHTML-only approach this would be hard for the consumer. Because to add the correct newlines, the consumer has to fully understand XHTML and detect block elements and replace them by \n

          To support both usages of TIKA the idea was to embed this information which is unimportant to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as "convenience" for the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands HTML it would detect a <p> element as a block element and format the output.

          Solr unfortunately has some strange approach: It is mainly interested in the text only contents, so ideally when consuming the HTLL it could use WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler). In that case TIKA would do the right thing automatically: It would extract only text from the body element and would use the "convenience whitespace" to format the text in ASCII-ART-like way (using tabs, newlines,...)
          Solr has a hybrid approach: It collects all into a content tag (which is similar to the above approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack where it allows to process parts of the documents (like the title element or all <em> elements). In that case it has several StringBuilders in parallel that are populated with the contents. The problems are here too, but cannot be solved by using ignorable whitespace: e.g. one indexes only all <em> elements (which are inline HTML elements no block elements), there is no whitespace so all em elements would be glued together in the em field of your index... I just mention this, in my opinion the SolrContentHandler needs more work to "correctly" understand HTML and not just collect element names in a map!

          Now to your complaint: You proposed to report the newlines as real character() events - but this is not the right thing to do here. As I said, HTML does not know these characters, they are ignored. The "formatting" is done by the element names (like <p>, <div>, <table>). So the "helper" whitespace for text-only consumers should be inserted as ignorableWhitespace only, if we would add it to the real character data we would report things that every HTML parser (like nekohtml) would never report to the consumer. Nekohtml would also report this useless extra whitespace as ignorable.

          The convenience here is that TIKA's XHTMLContentHandler used by all parsers is "configured" to help the text-only user, but don't hurt the HTML-only user. This differentiation is done by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but also report the ASCII-ART-text-only content like TABs indide tables, newlines after block elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser must ignore it - and its correct to do this.

          Show
          Uwe Schindler added a comment - Hoss: I just took this issue because it was unassigned and I was the one mandating to add ignorable whitespace at that time in TIKA. So Jukka and I decided this would be the best. Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems: TIKA decided to use XHTML as its output format to report the parsed documents to the consumer. This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...) originating from the original document. Of course most of this formatting is lost, but you can still "detect" things like emphasized text. By choosing XHTML as output format, of course TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed, the TIKA pasrer emits a <br/> tag or places the "paragraph" (in a PDF) inside a <p/> element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated as one single whitespace, so means like this regreplace: s/\s+/ / On the other hand, TIKA wants to make it simple for people to extract the plain text contents. With the XHTML-only approach this would be hard for the consumer. Because to add the correct newlines, the consumer has to fully understand XHTML and detect block elements and replace them by \n To support both usages of TIKA the idea was to embed this information which is unimportant to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as "convenience" for the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands HTML it would detect a <p> element as a block element and format the output. Solr unfortunately has some strange approach: It is mainly interested in the text only contents, so ideally when consuming the HTLL it could use WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler) . In that case TIKA would do the right thing automatically: It would extract only text from the body element and would use the "convenience whitespace" to format the text in ASCII-ART-like way (using tabs, newlines,...) Solr has a hybrid approach: It collects all into a content tag (which is similar to the above approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack where it allows to process parts of the documents (like the title element or all <em> elements). In that case it has several StringBuilders in parallel that are populated with the contents. The problems are here too, but cannot be solved by using ignorable whitespace: e.g. one indexes only all <em> elements (which are inline HTML elements no block elements), there is no whitespace so all em elements would be glued together in the em field of your index... I just mention this, in my opinion the SolrContentHandler needs more work to "correctly" understand HTML and not just collect element names in a map! Now to your complaint: You proposed to report the newlines as real character() events - but this is not the right thing to do here. As I said, HTML does not know these characters, they are ignored. The "formatting" is done by the element names (like <p>, <div>, <table>). So the "helper" whitespace for text-only consumers should be inserted as ignorableWhitespace only, if we would add it to the real character data we would report things that every HTML parser (like nekohtml) would never report to the consumer. Nekohtml would also report this useless extra whitespace as ignorable. The convenience here is that TIKA's XHTMLContentHandler used by all parsers is "configured" to help the text-only user, but don't hurt the HTML-only user. This differentiation is done by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but also report the ASCII-ART-text-only content like TABs indide tables, newlines after block elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser must ignore it - and its correct to do this.
          Hide
          Hoss Man added a comment -

          Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems:

          I never said that ... you said "I can take the issue if you like." and you explained why the existing patch should be committed – i'm totally willing to go along with that, so have at it. it seems sketchy to me, but if that's the way Tika works that's the way tika works, you certainly understand it better then me, so i defer to your assesment.

          (as mentioned in TIKA-1134 it would be nice if this type of behavior was better documented for people implementing their own ContentHandlers, but that's a Tika issue not a Solr issue.)

          Show
          Hoss Man added a comment - Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems: I never said that ... you said "I can take the issue if you like." and you explained why the existing patch should be committed – i'm totally willing to go along with that, so have at it. it seems sketchy to me, but if that's the way Tika works that's the way tika works, you certainly understand it better then me, so i defer to your assesment. (as mentioned in TIKA-1134 it would be nice if this type of behavior was better documented for people implementing their own ContentHandlers, but that's a Tika issue not a Solr issue.)
          Hide
          Uwe Schindler added a comment -

          I never said that ...

          You somehow said:

          I defer to your judgement on this

          So I assumed that you are still not 100% convinced. Sorry.

          In any case I will take the issue. In my opinion there is more work to be done with this crazy stack of StringBuilders to better handle the ignorableWhitepace when a new field begins/ends. Currently its insered after the block end tag, so it would go one up in the stack only. I have to think a little bit about it, but the fix in your patch is the easiest for now. And the maybe useless whitespace on some lower stacked StringBuilders is generally removed by text analysis.

          Show
          Uwe Schindler added a comment - I never said that ... You somehow said: I defer to your judgement on this So I assumed that you are still not 100% convinced. Sorry. In any case I will take the issue. In my opinion there is more work to be done with this crazy stack of StringBuilders to better handle the ignorableWhitepace when a new field begins/ends. Currently its insered after the block end tag, so it would go one up in the stack only. I have to think a little bit about it, but the fix in your patch is the easiest for now. And the maybe useless whitespace on some lower stacked StringBuilders is generally removed by text analysis.
          Hide
          Christoph Straßer added a comment - - edited

          @Uwe: Big thanks for taking care of this issue!
          @Hoss Man: Thank you for your input!

          Show
          Christoph Straßer added a comment - - edited @Uwe: Big thanks for taking care of this issue! @Hoss Man: Thank you for your input!
          Hide
          ASF subversion and git services added a comment -

          Commit 1512296 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1512296 ]

          SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

          Show
          ASF subversion and git services added a comment - Commit 1512296 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1512296 ] SOLR-4679 , SOLR-4908 , SOLR-5124 : Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.
          Hide
          ASF subversion and git services added a comment -

          Commit 1512297 from Uwe Schindler in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1512297 ]

          Merged revision(s) 1512296 from lucene/dev/trunk:
          SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

          Show
          ASF subversion and git services added a comment - Commit 1512297 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512297 ] Merged revision(s) 1512296 from lucene/dev/trunk: SOLR-4679 , SOLR-4908 , SOLR-5124 : Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.
          Hide
          Adrien Grand added a comment -

          4.5 release -> bulk close

          Show
          Adrien Grand added a comment - 4.5 release -> bulk close

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Christoph Straßer
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development