Solr
  1. Solr
  2. SOLR-3895

For several reasons, disabling the resolving of external entities within the Solr UpdateRequestHandler for XML would be good.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6.1, 4.0-BETA
    • Fix Version/s: 4.1, 5.0
    • Component/s: update
    • Labels:
      None

      Description

      The Solr UpdateRequestHandler for XML currently resolves so-called XML External Entities. Not resolving XML External Entities would - among other things - improve Solr's update performance.

      1. SOLR-3895+3614.patch
        15 kB
        Uwe Schindler
      2. SOLR-3895+3614.patch
        16 kB
        Uwe Schindler
      3. SOLR-3895.patch
        13 kB
        Uwe Schindler
      4. SOLR-3895.patch
        13 kB
        Hoss Man
      5. SOLR-3895.patch
        14 kB
        Hoss Man
      6. SOLR-3895.patch
        14 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          ASF subversion and git services added a comment -

          Commit 1547011 from Uwe Schindler in branch 'dev/branches/lucene_solr_3_6'
          [ https://svn.apache.org/r1547011 ]

          SOLR-5520: Backports of:

          • SOLR-4881 (Fix DocumentAnalysisRequestHandler to correctly use EmptyEntityResolver to prevent loading of external entities like UpdateRequestHandler does)
          • SOLR-3895 (XML and XSLT UpdateRequestHandler should not try to resolve external entities)
          • SOLR-3614 (Fix XML parsing in XPathEntityProcessor to correctly expand named entities, but ignore external entities)
          Show
          ASF subversion and git services added a comment - Commit 1547011 from Uwe Schindler in branch 'dev/branches/lucene_solr_3_6' [ https://svn.apache.org/r1547011 ] SOLR-5520 : Backports of: SOLR-4881 (Fix DocumentAnalysisRequestHandler to correctly use EmptyEntityResolver to prevent loading of external entities like UpdateRequestHandler does) SOLR-3895 (XML and XSLT UpdateRequestHandler should not try to resolve external entities) SOLR-3614 (Fix XML parsing in XPathEntityProcessor to correctly expand named entities, but ignore external entities)
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1390924

          Merged revision(s) 1390921 from lucene/dev/trunk:
          SOLR-3895, SOLR-3614: XML and XSLT UpdateRequestHandler should not try to resolve external entities; fix XML parsing in XPathEntityProcessor to correctly expand named entities, but ignore external entities

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1390924 Merged revision(s) 1390921 from lucene/dev/trunk: SOLR-3895 , SOLR-3614 : XML and XSLT UpdateRequestHandler should not try to resolve external entities; fix XML parsing in XPathEntityProcessor to correctly expand named entities, but ignore external entities
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Uwe Schindler
          http://svn.apache.org/viewvc?view=revision&revision=1390999

          Merged revision(s) 1390991 from lucene/dev/trunk:
          SOLR-3895, SOLR-3614: Fix javadocs

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Uwe Schindler http://svn.apache.org/viewvc?view=revision&revision=1390999 Merged revision(s) 1390991 from lucene/dev/trunk: SOLR-3895 , SOLR-3614 : Fix javadocs
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1390921
          Committed 4.x revision: 1390924

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1390921 Committed 4.x revision: 1390924
          Hide
          Uwe Schindler added a comment -

          New patch that also includes Hossmans testcase from SOLR-3614.

          By the way, this patch also makes the StAX factory thread safe like in XMLUpdater.

          Show
          Uwe Schindler added a comment - New patch that also includes Hossmans testcase from SOLR-3614 . By the way, this patch also makes the StAX factory thread safe like in XMLUpdater.
          Hide
          Uwe Schindler added a comment -

          Patch for this issue and SOLR-3614. I removed the changes in the XSL, as this is already tested in a separate test case.

          I think it's ready to commit.

          Show
          Uwe Schindler added a comment - Patch for this issue and SOLR-3614 . I removed the changes in the XSL, as this is already tested in a separate test case. I think it's ready to commit.
          Hide
          Uwe Schindler added a comment -
          +<!-- DOCTYPE is used test that external entities are not resolved, but named entities are -->
          

          This should be: "DOCTYPE is used to test that named entities are resolved". Or better: revert the whole XSL...
          There is already a test that explicitely tests external entities for XSLs You find the test and the entity in TestXSLTResponseWriter and related files. So the whole test with adding entities to the XSL is separately handled, we can undo the XSL change.

          Show
          Uwe Schindler added a comment - +<!-- DOCTYPE is used test that external entities are not resolved, but named entities are --> This should be: "DOCTYPE is used to test that named entities are resolved". Or better: revert the whole XSL... There is already a test that explicitely tests external entities for XSLs You find the test and the entity in TestXSLTResponseWriter and related files. So the whole test with adding entities to the XSL is separately handled, we can undo the XSL change.
          Hide
          Hoss Man added a comment -

          yeah .. that was a brain fart.

          I fixed the test, commit when ready.

          Show
          Hoss Man added a comment - yeah .. that was a brain fart. I fixed the test, commit when ready.
          Hide
          Uwe Schindler added a comment -

          Finally: the new patch is fine, I would just revert the changes in the XSL file, this is unrelated to this issue. of course value of "zippy_s" is then not "plop".

          Show
          Uwe Schindler added a comment - Finally: the new patch is fine, I would just revert the changes in the XSL file, this is unrelated to this issue. of course value of "zippy_s" is then not "plop".
          Hide
          Uwe Schindler added a comment -

          But this lead me to discover that SYSTEM entities in the xsl docs (aparently) aren't being ignored, so with the patch attached (trying to refer to a bogs file fro mthe XSL) you get a stylesheet compilation error – so i think maybe there is still a code path missing the use of your new EmptyEntityResolver?

          The XSL stylesheets are loaded by SolrResourceLoader like config files,... All those files use external entities (and SolrResourceLoader has an own EntityResolver to support xinclude and external entities -> it resolves external entities and xincludes using ResourceLoader, too) -> different story. This patch is about disabling entities for loaded documents, not for config files like solrconfig.xml, schema.xml or the transformation.

          Show
          Uwe Schindler added a comment - But this lead me to discover that SYSTEM entities in the xsl docs (aparently) aren't being ignored, so with the patch attached (trying to refer to a bogs file fro mthe XSL) you get a stylesheet compilation error – so i think maybe there is still a code path missing the use of your new EmptyEntityResolver? The XSL stylesheets are loaded by SolrResourceLoader like config files,... All those files use external entities (and SolrResourceLoader has an own EntityResolver to support xinclude and external entities -> it resolves external entities and xincludes using ResourceLoader, too) -> different story. This patch is about disabling entities for loaded documents, not for config files like solrconfig.xml, schema.xml or the transformation.
          Hide
          Hoss Man added a comment -

          If a user requests this feature, we can enable in a later issue. But then also xinclude should be enabled. Any other opinions?

          Nope, like i said: we can worry about the detials of making it configurable if/when anyone asks for it.

          I can extend the patch to also take care of DIH (SOLR-3614)

          i figured maybe we could - but we can also worry about it in that issue after this one gets commited, whatever you think is easier.

          We should add the same one for XSLReqHandler (because it uses other parser).

          good call – i updated the patch to test that named entities still work in XSL transformed docs & stylesheets, and they do. But this lead me to discover that SYSTEM entities in the xsl docs (aparently) aren't being ignored, so with the patch attached (trying to refer to a bogs file fro mthe XSL) you get a stylesheet compilation error – so i think maybe there is still a code path missing the use of your new EmptyEntityResolver? (either that or i've just got a silly bug in the stylesheet that isn't obvious to me)

          Show
          Hoss Man added a comment - If a user requests this feature, we can enable in a later issue. But then also xinclude should be enabled. Any other opinions? Nope, like i said: we can worry about the detials of making it configurable if/when anyone asks for it. I can extend the patch to also take care of DIH ( SOLR-3614 ) i figured maybe we could - but we can also worry about it in that issue after this one gets commited, whatever you think is easier. We should add the same one for XSLReqHandler (because it uses other parser). good call – i updated the patch to test that named entities still work in XSL transformed docs & stylesheets, and they do. But this lead me to discover that SYSTEM entities in the xsl docs (aparently) aren't being ignored, so with the patch attached (trying to refer to a bogs file fro mthe XSL) you get a stylesheet compilation error – so i think maybe there is still a code path missing the use of your new EmptyEntityResolver? (either that or i've just got a silly bug in the stylesheet that isn't obvious to me)
          Hide
          Uwe Schindler added a comment -

          Hoss: I can extend the patch to also take care of DIH (SOLR-3614). Instead of the static initializer in SOLR-964, it should call the helper method in the resolver to configure the EnityResolver for DIH's StaX parser!

          Show
          Uwe Schindler added a comment - Hoss: I can extend the patch to also take care of DIH ( SOLR-3614 ). Instead of the static initializer in SOLR-964 , it should call the helper method in the resolver to configure the EnityResolver for DIH's StaX parser!
          Hide
          Uwe Schindler added a comment -

          Thanks Hoss for the test! We should add the same one for XSLReqHandler (because it uses other parser).

          If a user requests this feature, we can enable in a later issue. But then also xinclude should be enabled. Any other opinions?

          Show
          Uwe Schindler added a comment - Thanks Hoss for the test! We should add the same one for XSLReqHandler (because it uses other parser). If a user requests this feature, we can enable in a later issue. But then also xinclude should be enabled. Any other opinions?
          Hide
          Hoss Man added a comment -

          Arguably this is a feature (using entity includes to split up files and manage updates) – but since it's never one we advertised, and can cause Solr to fetch external URLs like DTD that aren't needed, I agree it should either be off or configurable and off by default (we can worry about that if anyone asks for it)

          One thing we should worry about though is ensuring that named entities still work (because we've definitely seen users take advantage of those) ... i've updated Uwe's patch with a test to ensure we don't break that.

          Show
          Hoss Man added a comment - Arguably this is a feature (using entity includes to split up files and manage updates) – but since it's never one we advertised, and can cause Solr to fetch external URLs like DTD that aren't needed, I agree it should either be off or configurable and off by default (we can worry about that if anyone asks for it) One thing we should worry about though is ensuring that named entities still work (because we've definitely seen users take advantage of those) ... i've updated Uwe's patch with a test to ensure we don't break that.
          Hide
          Uwe Schindler added a comment -

          Patch. This approach adds a generic EntityResolver for SAX and StaX parsers that can be used also outside XMLUpdateReqHandler, e.g. in DIH. It resolves all external entities to an empty InputStream. By this XMLUpdater ignores external entities like DTDs or entity declarations (&foobar;) and ignores them.

          The downside by this is, that XML documents that actually make use of these features to include some external stuff would silently ignore those &foobar; entities.

          Show
          Uwe Schindler added a comment - Patch. This approach adds a generic EntityResolver for SAX and StaX parsers that can be used also outside XMLUpdateReqHandler, e.g. in DIH. It resolves all external entities to an empty InputStream. By this XMLUpdater ignores external entities like DTDs or entity declarations (&foobar;) and ignores them. The downside by this is, that XML documents that actually make use of these features to include some external stuff would silently ignore those &foobar; entities.
          Hide
          Uwe Schindler added a comment -

          Hi Martin,
          thanks for your report after our communication about this before. I agree, it would be a good idea to not allow external entities (those can be e.g., references to external DTDs - but we never check XML validity according to a DTD) and also other external entities like &foobar; introduced by those DTDs should not be loaded:

          • Lot's of XML files come with a DTD declaration (like XHTML document or similar things). If you would pass those XML documents through the update handler (with e.g. XSL transforming to Solr XML), those DTDs would be resolved and loaded by the xml parser - with no use for Solr.
          • All documents passed to XMLRequestHandler should be self-complete, means no includes or similar things. xinclude is not enabled for XML-updates, so external entities should also be ignored.
          Show
          Uwe Schindler added a comment - Hi Martin, thanks for your report after our communication about this before. I agree, it would be a good idea to not allow external entities (those can be e.g., references to external DTDs - but we never check XML validity according to a DTD) and also other external entities like &foobar; introduced by those DTDs should not be loaded: Lot's of XML files come with a DTD declaration (like XHTML document or similar things). If you would pass those XML documents through the update handler (with e.g. XSL transforming to Solr XML), those DTDs would be resolved and loaded by the xml parser - with no use for Solr. All documents passed to XMLRequestHandler should be self-complete, means no includes or similar things. xinclude is not enabled for XML-updates, so external entities should also be ignored.

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Martin Herfurt
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development