Solr
  1. Solr
  2. SOLR-6856

regression in /update/extract ? ref guide examples of fmap & xpath don't seem to be working

    Details

      Description

      I updated this page to know about hte new bin/solr and example/exampledocs structure/contents...
      https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

      however i noticed that several of the examples listed on that page didn't seem to work any more – notably...

      • examples using "fmap" don't seem to create the fields they say they will
      • examples using "xpath" don't seem to create any docs at all

      Specific examples i had problems with...

      curl "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div&commit=true" -F "sample=@example/exampledocs/sample.html"
      curl "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc3&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&commit=true" -F "sample=@example/exampledocs/sample.html"
      curl "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah&commit=true" -F "sample=@example/exampledocs/sample.html"
      curl "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.id=id&xpath=/xhtml:html/xhtml:body/xhtml:div/descendant:node()&commit=true" -F "sample=@example/exampledocs/sample.html"
      

      ...none of these example commands produced an error, but they also didn't seem to create the fields/docs they said they would (ie: no "foo_t" field was created)

      1. SOLR-6856.patch
        8 kB
        Steve Rowe
      2. SOLR-6856.patch
        4 kB
        Steve Rowe

        Issue Links

          Activity

          Hide
          Anshum Gupta added a comment -

          Is anyone looking at this yet? This is the only blocker for 5.0. now.

          Show
          Anshum Gupta added a comment - Is anyone looking at this yet? This is the only blocker for 5.0. now.
          Hide
          Steve Rowe added a comment -

          I'm looking into the fmap issue, and I found that the same error is tripped whether defaultField or uprefix is used.

          However, when I switch to capturing <p> elements' content rather than <div> elements', using &fmap.p=foo_t&capture=p, the failure goes away. Looks like the problem is with the capturing phase, not the mapping phase.

          So maybe this is about block vs. inline elements?

          Digging more.

          Show
          Steve Rowe added a comment - I'm looking into the fmap issue, and I found that the same error is tripped whether defaultField or uprefix is used. However, when I switch to capturing <p> elements' content rather than <div> elements', using &fmap.p=foo_t&capture=p , the failure goes away. Looks like the problem is with the capturing phase, not the mapping phase. So maybe this is about block vs. inline elements? Digging more.
          Hide
          Steve Rowe added a comment -

          So maybe this is about block vs. inline elements?

          That actually doesn't make any sense, since both <p> and div are block elements.

          I looked at the SAX events sent into SolrContentHandler.startElement(), where the capture mechanism is invoked, and it never sees <div> start-tag events, but <p> start-tags come through.

          Show
          Steve Rowe added a comment - So maybe this is about block vs. inline elements? That actually doesn't make any sense, since both <p> and div are block elements. I looked at the SAX events sent into SolrContentHandler.startElement() , where the capture mechanism is invoked, and it never sees <div> start-tag events, but <p> start-tags come through.
          Hide
          Steve Rowe added a comment - - edited

          The <div> capture issue (also a mapping issue since there's nothing to map) is that in Tika 0.6 (Solr 3.1 upgraded Tika from 0.4 to 0.8), a DefaultHtmlMapper was introduced that only creates events for a subset of HTML tags - when HtmlMapper.mapSafeElement() returns null for a tag (as it does for any non-mapped tags), its child content is processed, but no event is created for it. HtmlParser uses DefaultHtmlMapper if no HtmlMapper.class mapping is supplied with ParseContext. Here's the 1.7 DefaultHtmlMapper.SAFE_ELEMENTS definition, where the mappings are initialized: http://svn.apache.org/viewvc/tika/tags/1.7/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java?view=markup#l33 - no <div> in there.

          The attached patch maps the HtmlMapper.class in ParseContext to IdentityHtmlMapper, which creates events for every HTML element. A new test is added to check that elements including <div> are captured and mapped properly.

          Show
          Steve Rowe added a comment - - edited The <div> capture issue (also a mapping issue since there's nothing to map) is that in Tika 0.6 (Solr 3.1 upgraded Tika from 0.4 to 0.8), a DefaultHtmlMapper was introduced that only creates events for a subset of HTML tags - when HtmlMapper.mapSafeElement() returns null for a tag (as it does for any non-mapped tags), its child content is processed, but no event is created for it. HtmlParser uses DefaultHtmlMapper if no HtmlMapper.class mapping is supplied with ParseContext . Here's the 1.7 DefaultHtmlMapper.SAFE_ELEMENTS definition, where the mappings are initialized: http://svn.apache.org/viewvc/tika/tags/1.7/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java?view=markup#l33 - no <div> in there. The attached patch maps the HtmlMapper.class in ParseContext to IdentityHtmlMapper , which creates events for every HTML element. A new test is added to check that elements including <div> are captured and mapped properly.
          Hide
          Steve Rowe added a comment -

          I'm looking into the xpath examples now to see if they are caused by the same issue.

          Show
          Steve Rowe added a comment - I'm looking into the xpath examples now to see if they are caused by the same issue.
          Hide
          Uwe Schindler added a comment -

          So this issue exists since Solr 3.1... Strange that nobody ever complained. I would prefer to cleanup the HTML as this mapper does, but the IdentityMapper is also fine to me. I think nobody ever wished to capture useless dif elements

          Thanks Steve for investigating the problem!

          Show
          Uwe Schindler added a comment - So this issue exists since Solr 3.1... Strange that nobody ever complained. I would prefer to cleanup the HTML as this mapper does, but the IdentityMapper is also fine to me. I think nobody ever wished to capture useless dif elements Thanks Steve for investigating the problem!
          Hide
          Hoss Man added a comment -

          my main concern when filing this issue is that examples from the ref guide (which presumably worked at some point) did not work (now).

          if folks think the current behavior is "better" then the "fixed" behavior in steve's patch – that's fine with me, we can just resolve the issue by fixing hte docs to use better examples.

          i don't understand the diff in behavior enough to have an opinion.

          Show
          Hoss Man added a comment - my main concern when filing this issue is that examples from the ref guide (which presumably worked at some point) did not work (now). if folks think the current behavior is "better" then the "fixed" behavior in steve's patch – that's fine with me, we can just resolve the issue by fixing hte docs to use better examples. i don't understand the diff in behavior enough to have an opinion.
          Hide
          Alexandre Rafalovitch added a comment -

          I fixed something similar for DIH in 4.3 SOLR-4530

          Show
          Alexandre Rafalovitch added a comment - I fixed something similar for DIH in 4.3 SOLR-4530
          Hide
          Steve Rowe added a comment -

          I fixed something similar for DIH in 4.3 SOLR-4530

          Thanks for the pointer Alexandre Rafalovitch.

          Show
          Steve Rowe added a comment - I fixed something similar for DIH in 4.3 SOLR-4530 Thanks for the pointer Alexandre Rafalovitch .
          Hide
          Steve Rowe added a comment - - edited

          I noticed when using IdentityHtmlMapper that the <br/> tag causes the string "none\n" to show up in the catch-all content field Tika produces, e.g. for

          <p>distinct<br/>words</p>
          

          the following is extracted in the catch-all field:

           distinctnone
          words
          

          I suspect this is a Tika bug, but I didn't track down why this happens.

          I addressed this problem by copy/pasting IdentityHtmlMapper into a nested class of ExtractingDocumentLoader and overriding mapSafeElement(String name) to return null when name is br - this causes Tika to not output the "none\n" string in the catch-all field.

          I noticed that DefaultHtmlMapper excludes <SCRIPT> and <STYLE> tags and their content, while IdentityHtmlMapper does not. I thought I would need to also address this, because a general-purpose HTML extraction facility should not include that <SCRIPT> or <STYLE> content. But apparently Tika handles exclusion of these tags and their content at some location other than the HtmlMapper - even when using IdentityHtmlMapper, no start-element events are created for these tags. Nevertheless, I added a test to make sure that <SCRIPT> and <STYLE> content is not extracted.

          I added a new test to more fully demonstrate that xpath handling works properly.

          I think it's ready to go. I'm running all Solr tests now.

          Show
          Steve Rowe added a comment - - edited I noticed when using IdentityHtmlMapper that the <br/> tag causes the string "none\n" to show up in the catch-all content field Tika produces, e.g. for <p> distinct <br/> words </p> the following is extracted in the catch-all field: distinctnone words I suspect this is a Tika bug, but I didn't track down why this happens. I addressed this problem by copy/pasting IdentityHtmlMapper into a nested class of ExtractingDocumentLoader and overriding mapSafeElement(String name) to return null when name is br - this causes Tika to not output the "none\n" string in the catch-all field. I noticed that DefaultHtmlMapper excludes <SCRIPT> and <STYLE> tags and their content, while IdentityHtmlMapper does not. I thought I would need to also address this, because a general-purpose HTML extraction facility should not include that <SCRIPT> or <STYLE> content. But apparently Tika handles exclusion of these tags and their content at some location other than the HtmlMapper - even when using IdentityHtmlMapper , no start-element events are created for these tags. Nevertheless, I added a test to make sure that <SCRIPT> and <STYLE> content is not extracted. I added a new test to more fully demonstrate that xpath handling works properly. I think it's ready to go. I'm running all Solr tests now.
          Hide
          Anshum Gupta added a comment -

          Thanks for the patch Steve Rowe. It's a blocker but doesn't have a fix version. 5.0 ?

          Show
          Anshum Gupta added a comment - Thanks for the patch Steve Rowe . It's a blocker but doesn't have a fix version. 5.0 ?
          Hide
          Steve Rowe added a comment -

          I'm running all Solr tests now.

          BUILD SUCCESSFUL
          Total time: 22 minutes 23 seconds
          

          I'll go and run all the ref guide examples before I commit.

          Show
          Steve Rowe added a comment - I'm running all Solr tests now. BUILD SUCCESSFUL Total time: 22 minutes 23 seconds I'll go and run all the ref guide examples before I commit.
          Hide
          Steve Rowe added a comment -

          I'll go and run all the ref guide examples before I commit.

          They all run properly, though a couple need adjustment - I'll fix them in a minute.

          Committing shortly.

          Show
          Steve Rowe added a comment - I'll go and run all the ref guide examples before I commit. They all run properly, though a couple need adjustment - I'll fix them in a minute. Committing shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1654431 from Use account "steve_rowe" instead in branch 'dev/trunk'
          [ https://svn.apache.org/r1654431 ]

          SOLR-6856: Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML.

          Show
          ASF subversion and git services added a comment - Commit 1654431 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1654431 ] SOLR-6856 : Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML.
          Hide
          ASF subversion and git services added a comment -

          Commit 1654432 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1654432 ]

          SOLR-6856: Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML. (merged trunk r1654431)

          Show
          ASF subversion and git services added a comment - Commit 1654432 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1654432 ] SOLR-6856 : Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML. (merged trunk r1654431)
          Hide
          ASF subversion and git services added a comment -

          Commit 1654433 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1654433 ]

          SOLR-6856: Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML. (merged trunk r1654431)

          Show
          ASF subversion and git services added a comment - Commit 1654433 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1654433 ] SOLR-6856 : Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML. (merged trunk r1654431)
          Hide
          Anshum Gupta added a comment -

          Thanks for fixing this Steve!

          Show
          Anshum Gupta added a comment - Thanks for fixing this Steve!
          Hide
          Steve Rowe added a comment -

          Chris Hostetter (Unused) and Uwe Schindler figured out why <br/> caused "none\n" to be dumped into the catch-all field. See the #lucene-dev IRC conversation here.

          In short, the tagsoup parser used by Tika exposes default attributes on (some?) elements, and the <br> tag has a default attribute clear with value none. When solr-cell's captureAttr param is false, all attribute values are dumped into the catch-all text field. Uwe pointed out that this is terrible, since lots of junk like style and class attributes will get dumped there. We did identify several useful attributes though: href, title, alt, summary. I'll make another issue for post-5.0 to try to improve the situation.

          As noted above, the "none\n" (and all other attribute values) pasted into the catch-all field has no preceding whitespace, so tokenization will break. Uwe provided a patch to fix this:

          Index: solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java
          ===================================================================
          --- solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java	(revision 1654430)
          +++ solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java	(working copy)
          @@ -280,7 +280,7 @@
                 }
               } else {
                 for (int i = 0; i < attributes.getLength(); i++) {
          -        bldrStack.getLast().append(attributes.getValue(i)).append(' ');
          +        bldrStack.getLast().append(' ').append(attributes.getValue(i));
                 }
               }
               bldrStack.getLast().append(' ');
          

          Committing shortly.

          Show
          Steve Rowe added a comment - Chris Hostetter (Unused) and Uwe Schindler figured out why <br/> caused "none\n" to be dumped into the catch-all field. See the #lucene-dev IRC conversation here . In short, the tagsoup parser used by Tika exposes default attributes on (some?) elements, and the <br> tag has a default attribute clear with value none . When solr-cell's captureAttr param is false , all attribute values are dumped into the catch-all text field. Uwe pointed out that this is terrible, since lots of junk like style and class attributes will get dumped there. We did identify several useful attributes though: href, title, alt, summary. I'll make another issue for post-5.0 to try to improve the situation. As noted above, the "none\n" (and all other attribute values) pasted into the catch-all field has no preceding whitespace, so tokenization will break. Uwe provided a patch to fix this: Index: solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java =================================================================== --- solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java (revision 1654430) +++ solr/contrib/extraction/src/java/org/apache/solr/handler/extraction/SolrContentHandler.java (working copy) @@ -280,7 +280,7 @@ } } else { for (int i = 0; i < attributes.getLength(); i++) { - bldrStack.getLast().append(attributes.getValue(i)).append(' '); + bldrStack.getLast().append(' ').append(attributes.getValue(i)); } } bldrStack.getLast().append(' '); Committing shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1654444 from Use account "steve_rowe" instead in branch 'dev/trunk'
          [ https://svn.apache.org/r1654444 ]

          SOLR-6856: fix preceding whitespace for attribute values dumped into the catch-all field.

          Show
          ASF subversion and git services added a comment - Commit 1654444 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1654444 ] SOLR-6856 : fix preceding whitespace for attribute values dumped into the catch-all field.
          Hide
          ASF subversion and git services added a comment -

          Commit 1654445 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1654445 ]

          SOLR-6856: fix preceding whitespace for attribute values dumped into the catch-all field. (merged trunk r1654444)

          Show
          ASF subversion and git services added a comment - Commit 1654445 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1654445 ] SOLR-6856 : fix preceding whitespace for attribute values dumped into the catch-all field. (merged trunk r1654444)
          Hide
          ASF subversion and git services added a comment -

          Commit 1654446 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1654446 ]

          SOLR-6856: fix preceding whitespace for attribute values dumped into the catch-all field. (merged trunk r1654444)

          Show
          ASF subversion and git services added a comment - Commit 1654446 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1654446 ] SOLR-6856 : fix preceding whitespace for attribute values dumped into the catch-all field. (merged trunk r1654444)
          Hide
          Steve Rowe added a comment -

          Here's the issue for improving catch-all attribute extraction: SOLR-7027

          Show
          Steve Rowe added a comment - Here's the issue for improving catch-all attribute extraction: SOLR-7027
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          Anshum Gupta added a comment -

          Reopening for back porting to 4.10.4

          Show
          Anshum Gupta added a comment - Reopening for back porting to 4.10.4
          Hide
          ASF subversion and git services added a comment -

          Commit 1662499 from Anshum Gupta in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1662499 ]

          SOLR-6856: Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML and fix preceding whitespace for attribute values dumped into the catch-all field. (merge from 5x)

          Show
          ASF subversion and git services added a comment - Commit 1662499 from Anshum Gupta in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662499 ] SOLR-6856 : Restore ExtractingRequestHandler's ability to capture all HTML tags when parsing (X)HTML and fix preceding whitespace for attribute values dumped into the catch-all field. (merge from 5x)
          Hide
          Michael McCandless added a comment -

          Bulk close for 4.10.4 release

          Show
          Michael McCandless added a comment - Bulk close for 4.10.4 release

            People

            • Assignee:
              Anshum Gupta
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development