Solr
  1. Solr
  2. SOLR-2854

Load URL content stream on-demand, rather than automatically

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:

      Description

      I think the remote streaming feature should be limited to update request processors. I'm not sure if there is even any use of using it on a /select, but even if there is, it's an unintended security risk. Observe this URL that is roughly the equivalent of an SQL injection attack:

      http://localhost:8983/solr/select?q=*:*&indent=on&wt=ruby&rows=2&stream.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fupdate%3Fcommit%3Dtruetream.body%3D%3Cdelete%3E%3Cquery%3E*%3A*%3C%2Fquery%3E%3C%2Fdelete%3E

      Yep; that's right – this search deletes all the data in your Solr instance! If you blocked off access to /update* based on IP then that isn't good enough.

        Issue Links

          Activity

          David Smiley created issue -
          Hide
          Erik Hatcher added a comment -

          I don't think we should restrict non-update request handlers from handling streams. Consider DocumentAnalysisRequestHandler - it's handy to be able to stream in text for analysis. There are other request handlers that leverage this as well.

          Perhaps, though, a solution is to not allow streams to be resolved unless they are truly needed by the request handler? In your example, there is no need for the standard search request handler to access any streams and thus that URL shouldn't be hit.

          Show
          Erik Hatcher added a comment - I don't think we should restrict non-update request handlers from handling streams. Consider DocumentAnalysisRequestHandler - it's handy to be able to stream in text for analysis. There are other request handlers that leverage this as well. Perhaps, though, a solution is to not allow streams to be resolved unless they are truly needed by the request handler? In your example, there is no need for the standard search request handler to access any streams and thus that URL shouldn't be hit.
          Hide
          Ryan McKinley added a comment -

          Here is a quick totally untested patch that should behave as Erik describes. Rather then create the URLConnection in the constructor, it waits for someone to call getStream()

          this will make effectively limit streaming to requests that hit something that uses it

          Show
          Ryan McKinley added a comment - Here is a quick totally untested patch that should behave as Erik describes. Rather then create the URLConnection in the constructor, it waits for someone to call getStream() this will make effectively limit streaming to requests that hit something that uses it
          Ryan McKinley made changes -
          Field Original Value New Value
          Attachment SOLR-2854-delay-stream-opening.patch [ 12500878 ]
          Hide
          Erik Hatcher added a comment -

          +1 on Ryan's approach. David, does that work for you? Can someone drum up some test cases to add along with this?

          Also, this isn't quite like a SQL injection attack where malicious strings can be put right into the query. It would require some application level flaws to send an arbitrary stream.url parameter over with user/data input, especially to a /select search request. But certainly the point is taken that bad things can happen with stream.url abuse.

          Show
          Erik Hatcher added a comment - +1 on Ryan's approach. David, does that work for you? Can someone drum up some test cases to add along with this? Also, this isn't quite like a SQL injection attack where malicious strings can be put right into the query. It would require some application level flaws to send an arbitrary stream.url parameter over with user/data input, especially to a /select search request. But certainly the point is taken that bad things can happen with stream.url abuse.
          Hide
          David Smiley added a comment -

          I will add a patch later to ensure that a select request doesn't access the stream.

          Show
          David Smiley added a comment - I will add a patch later to ensure that a select request doesn't access the stream.
          Hide
          David Smiley added a comment -

          This patch simply adds a test class with a few tests. One of those tests tries to use remote streaming with a select URL and it fails – by design. Once a fix for this issue works, this test should succeed.

          This is Test Driven Development, by the way.

          I don't have time left at the moment to see if Ryan's patch works.

          Show
          David Smiley added a comment - This patch simply adds a test class with a few tests. One of those tests tries to use remote streaming with a select URL and it fails – by design. Once a fix for this issue works, this test should succeed. This is Test Driven Development, by the way. I don't have time left at the moment to see if Ryan's patch works.
          David Smiley made changes -
          Erik Hatcher made changes -
          Assignee Erik Hatcher [ ehatcher ]
          Hide
          Erik Hatcher added a comment -

          David - Thanks for the strong TDD example! Thanks a lot for that, srsly.

          Ryan - Thanks to you for the quick fix.

          I tried out the test patch first, got the failure, applied Ryan's patch, test passes. TDD by the book.

          I've committed this to trunk, with the change history log of: "Now load URL content stream data (via stream.url) when called for during request handling, rather than loading URL content streams automatically regardless of use."

          I think the security aspect of this is a separate issue. What we've done here is only load URL content (file, etc content streams I double-checked, they late load already as it should) when a component calls out for it. So someone could still send in that same evil stream.url to /analysis/document. Let's spin off another issue for something like "Enable fine grained control over allowed content streams", such that one could disable URL content streams, but leave local file content streams possible, say. Not sure that entirely satisfies this issue though, as it certainly is the case that one would have situations where stream.url to load content is really handy, but you certainly don't want any loopback (or fan-out) from malicious data to kill a system either. What do others think about how to address this appropriately on the Solr side (even if that means simply making it clearer what stream.url really does underneath)?

          Show
          Erik Hatcher added a comment - David - Thanks for the strong TDD example! Thanks a lot for that, srsly. Ryan - Thanks to you for the quick fix. I tried out the test patch first, got the failure, applied Ryan's patch, test passes. TDD by the book. I've committed this to trunk, with the change history log of: "Now load URL content stream data (via stream.url) when called for during request handling, rather than loading URL content streams automatically regardless of use." I think the security aspect of this is a separate issue. What we've done here is only load URL content (file, etc content streams I double-checked, they late load already as it should) when a component calls out for it. So someone could still send in that same evil stream.url to /analysis/document. Let's spin off another issue for something like "Enable fine grained control over allowed content streams", such that one could disable URL content streams, but leave local file content streams possible, say. Not sure that entirely satisfies this issue though, as it certainly is the case that one would have situations where stream.url to load content is really handy, but you certainly don't want any loopback (or fan-out) from malicious data to kill a system either. What do others think about how to address this appropriately on the Solr side (even if that means simply making it clearer what stream.url really does underneath)?
          Hide
          Erik Hatcher added a comment -

          What's left on this issue? I suppose backporting it to 3.x is desirable for the masses? Do these patches work out of the box for 3.x? (if not, can someone whip that up?) Is this particular issue done now? Should we rename it to "Load URL content streams when needed, rather than automatically regardless"?

          Show
          Erik Hatcher added a comment - What's left on this issue? I suppose backporting it to 3.x is desirable for the masses? Do these patches work out of the box for 3.x? (if not, can someone whip that up?) Is this particular issue done now? Should we rename it to "Load URL content streams when needed, rather than automatically regardless"?
          Hide
          Yonik Seeley added a comment -

          I assume the current test failures are due to this change?

          1 tests failed.
          REGRESSION:  org.apache.solr.common.util.ContentStreamTest.testURLStream
          
          Error Message:
          null
          
          Stack Trace:
          java.lang.NullPointerException
                 at org.apache.solr.common.util.ContentStreamTest.testURLStream(ContentStreamTest.java:91)
                 at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:613)
                 at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
                 at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
          
          Show
          Yonik Seeley added a comment - I assume the current test failures are due to this change? 1 tests failed. REGRESSION: org.apache.solr.common.util.ContentStreamTest.testURLStream Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.solr.common.util.ContentStreamTest.testURLStream(ContentStreamTest.java:91) at org.apache.lucene.util.LuceneTestCase$2$1.evaluate(LuceneTestCase.java:613) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
          Hide
          Erik Hatcher added a comment -

          Yonik - oops, you're right. I was running the newly added test with -Dtestcase and just forgot to run the full suite before committing. Thank goodness for continuous integration build.

          I committed a fix for the test, since ContentStreamBase.URLStream does not set the size until getStream/getReader is called. But looking at where we use stream.getSize() (and other getters) other problems are caused as once a ContentStream is instantiated it is assumed to have a size and a type, but this isn't the case after Ryan's patch. Should we adjust all the places where we use content streams to getStream/getReader before anything else? I think so. And note on getStream/getReader that it must be called prior to getting any details of the stream like size and type.

          Show
          Erik Hatcher added a comment - Yonik - oops, you're right. I was running the newly added test with -Dtestcase and just forgot to run the full suite before committing. Thank goodness for continuous integration build. I committed a fix for the test, since ContentStreamBase.URLStream does not set the size until getStream/getReader is called. But looking at where we use stream.getSize() (and other getters) other problems are caused as once a ContentStream is instantiated it is assumed to have a size and a type, but this isn't the case after Ryan's patch. Should we adjust all the places where we use content streams to getStream/getReader before anything else? I think so. And note on getStream/getReader that it must be called prior to getting any details of the stream like size and type.
          Hide
          Erik Hatcher added a comment -

          I've been looking at the usage of content streams and the only place I found of concern is ExtractingDocumentLoader, which calls stream.getName()/getSize()/getContentType() before calling getStream(), but this seems to work fine on trunk despite the changes to lazy load the stream and those parameters. This doesn't seem like it should work properly. No?

          Show
          Erik Hatcher added a comment - I've been looking at the usage of content streams and the only place I found of concern is ExtractingDocumentLoader, which calls stream.getName()/getSize()/getContentType() before calling getStream(), but this seems to work fine on trunk despite the changes to lazy load the stream and those parameters. This doesn't seem like it should work properly. No?
          Hide
          Yonik Seeley added a comment -

          This doesn't seem like it should work properly. No?

          Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once?

          Show
          Yonik Seeley added a comment - This doesn't seem like it should work properly. No? Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once?
          Hide
          Erik Hatcher added a comment -

          Here's a patch that fixes the extracting request handler to pull the stream properties after getStream() is called. Stepping through with a debugger showed that before this change the metadata values were being set to null.

          Show
          Erik Hatcher added a comment - Here's a patch that fixes the extracting request handler to pull the stream properties after getStream() is called. Stepping through with a debugger showed that before this change the metadata values were being set to null.
          Erik Hatcher made changes -
          Attachment SOLR-2854-extract_fix.patch [ 12501102 ]
          Hide
          Erik Hatcher added a comment -

          Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once?

          No exception or multiple getStream() calls. Turns out the metadata attributes (which are ignored in the example /update/extract config) were being set to null (as evidenced by stepping through with a debugger attached). The latest patch fixes this. All seems to be well otherwise in investigating the use of other content stream usage where getStream() is called first. I'm going to commit this patch and add a comment to the getStream() method to note that it should be called before the other properties.

          Show
          Erik Hatcher added a comment - Definitely should be investigated... exception being swallowed somewhere? getStream() being called more than once? No exception or multiple getStream() calls. Turns out the metadata attributes (which are ignored in the example /update/extract config) were being set to null (as evidenced by stepping through with a debugger attached). The latest patch fixes this. All seems to be well otherwise in investigating the use of other content stream usage where getStream() is called first. I'm going to commit this patch and add a comment to the getStream() method to note that it should be called before the other properties.
          Hide
          David Smiley added a comment -

          Regarding future steps to take to make Solr more secure with regards to remote streaming: Personally, I think that, by default, the only handlers that should be able to use this are /update/ registered handlers. That makes Solr easier to secure and is also the biggest use case for this feature. I'd like it to be clearer in solrconfig.xml exactly which handlers can use remote streaming. Presently, you have to have internal knowledge to know that /analysis/document will use it – and that's not cool from a security perspective. You suggested limiting specific URLs or files or files vs URLs but I don't think that is important.

          Show
          David Smiley added a comment - Regarding future steps to take to make Solr more secure with regards to remote streaming: Personally, I think that, by default, the only handlers that should be able to use this are /update/ registered handlers. That makes Solr easier to secure and is also the biggest use case for this feature. I'd like it to be clearer in solrconfig.xml exactly which handlers can use remote streaming. Presently, you have to have internal knowledge to know that /analysis/document will use it – and that's not cool from a security perspective. You suggested limiting specific URLs or files or files vs URLs but I don't think that is important.
          Erik Hatcher made changes -
          Summary Limit remote streaming to update handlers Load URL content stream on-demand, rather than automatically
          Fix Version/s 4.0 [ 12314992 ]
          Erik Hatcher made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          David Smiley made changes -
          Link This issue relates to SOLR-2859 [ SOLR-2859 ]
          Hide
          David Smiley added a comment -

          Whoops; we didn't back-port this to 3x! That was a major over site and we just missed a release. Do we need to create a new issue?

          Show
          David Smiley added a comment - Whoops; we didn't back-port this to 3x! That was a major over site and we just missed a release. Do we need to create a new issue?
          Hide
          David Smiley added a comment -

          re-opening for branch_3x fix

          Show
          David Smiley added a comment - re-opening for branch_3x fix
          David Smiley made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          David Smiley added a comment -

          Attached is a patch for 3x. I saw the 4 commits Erik made to trunk and I applied it to my 3x checkout. I also added a comment in solrconfig.xml to say:
          SearchRequestHandler won't fetch it, but some others do.

          The patch was made with git; I hope that won't create a problem.

          Show
          David Smiley added a comment - Attached is a patch for 3x. I saw the 4 commits Erik made to trunk and I applied it to my 3x checkout. I also added a comment in solrconfig.xml to say: SearchRequestHandler won't fetch it, but some others do. The patch was made with git; I hope that won't create a problem.
          David Smiley made changes -
          David Smiley made changes -
          Fix Version/s 3.6 [ 12319065 ]
          Hide
          David Smiley added a comment -

          I committed the 3x backport in r1293115.

          Show
          David Smiley added a comment - I committed the 3x backport in r1293115.
          David Smiley made changes -
          Status Reopened [ 4 ] Closed [ 6 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              Erik Hatcher
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development