Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.2
    • Component/s: update
    • Labels:
      None

      Description

      The plugin framework should work for 'update' actions in addition to 'search' actions.

      For more discussion on this, see:
      http://www.nabble.com/Re%3A-Handling-disparate-data-sources-in-Solr-tf2918621.html#a8305828

      1. HandlerRefactoring-DRAFT-SRC.zip
        370 kB
        Ryan McKinley
      2. HandlerRefactoring-DRAFT-SRC.zip
        749 kB
        Ryan McKinley
      3. HandlerRefactoring.DRAFT.zip
        23 kB
        Ryan McKinley
      4. HandlerRefactoring.DRAFT.patch
        127 kB
        Ryan McKinley
      5. HandlerRefactoring.DRAFT.patch
        260 kB
        Ryan McKinley
      6. DispatchFilter.patch
        43 kB
        Ryan McKinley
      7. DispatchFilter.patch
        41 kB
        Ryan McKinley
      8. DispatchFilter.patch
        62 kB
        Ryan McKinley
      9. DispatchFilter.patch
        62 kB
        Ryan McKinley
      10. DispatchFilter.patch
        90 kB
        Ryan McKinley
      11. DispatchFilter.patch
        95 kB
        Ryan McKinley
      12. DispatchFilter.patch
        95 kB
        Ryan McKinley
      13. commons-io-1.2.jar
        64 kB
        Ryan McKinley
      14. commons-fileupload-20070107.jar
        49 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          I tried my hand at refactoring solr to have a more flexible plugin framework.

          To apply this patch, you will need to:
          1) apply: HandlerRefactoring.DRAFT.patch
          2) download HandlerRefactoring.DRAFT.zip and extract the contents to:
          \solr\src\java\org\apache\solr\handler

          (svn patches don' t let you add new directories!)

          All tests pass, and http://localhost:8983/solr/select/ & http://localhost:8983/solr/update point to the same servlets as before and should behave exactly as they did before.

          • I changed the RequestHandlers framework to map "action" > "name" > handler, rather then just "name" > handler
          • I added a filter (SolrRequestFilter.java) that gets applied to every request and checks if the path is a registered "action". If it is, it will be handled by the RequestHander otherwise it is passed down the filter chain.
          • I refactored 'SolrParams' to handles the general case, not only query params. I moved the specific params to o.a.s.handler.search.QueryParams. I deleted a few deprecated parameter options: it just got too messy to refactor depricated things.
          • I moved the 'query' handlers into org.apache.solr.handler.search.* In general, handlers are in a package with the name of their action.

          TODO:

          • The generic SolrRequest must somehow encapsulate the posted stream
          • Add cookies, remote host, remote user to the SolrRequest? (standard requests won't use them, but custom handlers may have a good use for them)
          • and much much more!
          • - - - - - - -

          Usage:
          http://localhost:8983/solr/$

          {action}

          /$

          {handler}

          ?args

          If no handler is specified, it will use the default handler for that action.

          Current: (still works)
          http://localhost:8983/solr/select/?q=solr (current style)

          New: (should be exact same output)
          http://localhost:8983/solr/search/?q=solr
          http://localhost:8983/solr/search/standard/?q=solr
          http://localhost:8983/solr/search/dismax/?q=solr

          http://localhost:8983/solr/update2/commit
          http://localhost:8983/solr/update2/optimize

          note: I am using 'search' and 'update2' so that the old URL still works and points to what people are used to.

          Document adding stubs:
          http://localhost:8983/solr/add/xml/
          http://localhost:8983/solr/add/csv/
          http://localhost:8983/solr/add/sql/

          Let me know what you think!

          Show
          Ryan McKinley added a comment - I tried my hand at refactoring solr to have a more flexible plugin framework. To apply this patch, you will need to: 1) apply: HandlerRefactoring.DRAFT.patch 2) download HandlerRefactoring.DRAFT.zip and extract the contents to: \solr\src\java\org\apache\solr\handler (svn patches don' t let you add new directories!) All tests pass, and http://localhost:8983/solr/select/ & http://localhost:8983/solr/update point to the same servlets as before and should behave exactly as they did before. I changed the RequestHandlers framework to map "action" > "name" > handler, rather then just "name" > handler I added a filter (SolrRequestFilter.java) that gets applied to every request and checks if the path is a registered "action". If it is, it will be handled by the RequestHander otherwise it is passed down the filter chain. I refactored 'SolrParams' to handles the general case, not only query params. I moved the specific params to o.a.s.handler.search.QueryParams. I deleted a few deprecated parameter options: it just got too messy to refactor depricated things. I moved the 'query' handlers into org.apache.solr.handler.search.* In general, handlers are in a package with the name of their action. TODO: The generic SolrRequest must somehow encapsulate the posted stream Add cookies, remote host, remote user to the SolrRequest? (standard requests won't use them, but custom handlers may have a good use for them) and much much more! - - - - - - - Usage: http://localhost:8983/solr/$ {action} /$ {handler} ?args If no handler is specified, it will use the default handler for that action. Current: (still works) http://localhost:8983/solr/select/?q=solr (current style) New: (should be exact same output) http://localhost:8983/solr/search/?q=solr http://localhost:8983/solr/search/standard/?q=solr http://localhost:8983/solr/search/dismax/?q=solr http://localhost:8983/solr/update2/commit http://localhost:8983/solr/update2/optimize note: I am using 'search' and 'update2' so that the old URL still works and points to what people are used to. Document adding stubs: http://localhost:8983/solr/add/xml/ http://localhost:8983/solr/add/csv/ http://localhost:8983/solr/add/sql/ Let me know what you think!
          Hide
          Ryan McKinley added a comment -

          I just uploaded: HandlerRefactoring-DRAFT-SRC.zip. It is not a patch, but it will be easier to integrate with /trunk.

          1) Extract the zip files
          2) Copy them into your solr directory. Overwrite all files (svn will give you diff)
          3) delete the following files: (they have been renamed or moved)

          src/java/org/apache/solr/request/DisMaxRequestHandler.java
          src/java/org/apache/solr/request/LocalSolrQueryRequest.java
          src/java/org/apache/solr/request/SolrQueryRequest.java
          src/java/org/apache/solr/request/SolrQueryRequestBase.java
          src/java/org/apache/solr/request/SolrRequestHandler.java
          src/java/org/apache/solr/request/StandardRequestHandler.java
          src/java/org/apache/solr/util/CommonParams.java
          src/java/org/apache/solr/util/DisMaxParams.java

          Show
          Ryan McKinley added a comment - I just uploaded: HandlerRefactoring-DRAFT-SRC.zip. It is not a patch, but it will be easier to integrate with /trunk. 1) Extract the zip files 2) Copy them into your solr directory. Overwrite all files (svn will give you diff) 3) delete the following files: (they have been renamed or moved) src/java/org/apache/solr/request/DisMaxRequestHandler.java src/java/org/apache/solr/request/LocalSolrQueryRequest.java src/java/org/apache/solr/request/SolrQueryRequest.java src/java/org/apache/solr/request/SolrQueryRequestBase.java src/java/org/apache/solr/request/SolrRequestHandler.java src/java/org/apache/solr/request/StandardRequestHandler.java src/java/org/apache/solr/util/CommonParams.java src/java/org/apache/solr/util/DisMaxParams.java
          Hide
          Ryan McKinley added a comment -

          I just upload a new version of HandlerRefactoring-DRAFT-SRC.zip

          In addition to the 8 files above, also delete:
          src/java/org/apache/solr/request/SolrQueryResponse.java
          src/webapp/src/org/apache/solr/servlet/SolrServlet.java
          src/webapp/src/org/apache/solr/servlet/SolrUpdateServlet.java

          There is also a clean copy on:
          http://svn.lapnap.net/solr/handler-draft/solr/
          This should be easier to install - or look at (without having to install)

          This version converts everything to use the new framework rather then keeping /select and /update on the old one. It also includes a draft proposal on how to deal with deal with GET vs POST body vs multipart content.

          It passes all the tests and seems to work exactly as before (with a few exceptions)

          to call:
          curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml;'
          rather then just:
          curl $URL --data-binary '<commit/>'

          (any ideas?)

          • - - - - - - -

          I define three basic types of request handlers in: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/
          1) standard. This gets everything from parameters (get or post)
          2) posted. This gets a reader from the posted body:
          http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/PostedRequestHandler.java
          3) multipart. This gets an iterator over each file item using commons-upload streaming API
          http://jakarta.apache.org/commons/fileupload/streaming.html
          http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/MultipartRequestHandler.java

          I think this takes care of every case... is anything missing?

          The http://svn.lapnap.net/solr/handler-draft/solr/src/webapp/src/org/apache/solr/servlet/SolrRequestFilter.java RequestFilter manages setting the reader or iterator for the proper handlers.

          When you run the example, i added the page http://localhost:8983/solr/up.html that should help you see a little of it in action.

          I added an example for each type:
          http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/add/

          • - -

          This added:
          commons-io-1.2.jar
          mysql-connector-java-5.0.4.jar
          commons-fileupload-20070107.jar

          to the library. If we want to get rid of commons-io, I am only using IOUtils.java

          Show
          Ryan McKinley added a comment - I just upload a new version of HandlerRefactoring-DRAFT-SRC.zip In addition to the 8 files above, also delete: src/java/org/apache/solr/request/SolrQueryResponse.java src/webapp/src/org/apache/solr/servlet/SolrServlet.java src/webapp/src/org/apache/solr/servlet/SolrUpdateServlet.java There is also a clean copy on: http://svn.lapnap.net/solr/handler-draft/solr/ This should be easier to install - or look at (without having to install) This version converts everything to use the new framework rather then keeping /select and /update on the old one. It also includes a draft proposal on how to deal with deal with GET vs POST body vs multipart content. It passes all the tests and seems to work exactly as before (with a few exceptions) /update content is returned with a ResponseWriter [my-BUG] I am unable to get some posted content to read its stream properly. I had to modify: http://svn.lapnap.net/solr/handler-draft/solr/example/exampledocs/post.sh to call: curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml;' rather then just: curl $URL --data-binary '<commit/>' (any ideas?) - - - - - - - I define three basic types of request handlers in: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/ 1) standard. This gets everything from parameters (get or post) 2) posted. This gets a reader from the posted body: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/PostedRequestHandler.java 3) multipart. This gets an iterator over each file item using commons-upload streaming API http://jakarta.apache.org/commons/fileupload/streaming.html http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/MultipartRequestHandler.java I think this takes care of every case... is anything missing? The http://svn.lapnap.net/solr/handler-draft/solr/src/webapp/src/org/apache/solr/servlet/SolrRequestFilter.java RequestFilter manages setting the reader or iterator for the proper handlers. When you run the example, i added the page http://localhost:8983/solr/up.html that should help you see a little of it in action. I added an example for each type: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/add/ - - This added: commons-io-1.2.jar mysql-connector-java-5.0.4.jar commons-fileupload-20070107.jar to the library. If we want to get rid of commons-io, I am only using IOUtils.java
          Hide
          J.J. Larrea added a comment -

          I think this is a fantastic effort, so please take my comments and suggestions for improvement below in the context of my appreciation you took the time to wade deeply into the SOLR request-handling code looking for places to improve it. If the request structure is cleaned up along these lines it will make it much simpler for people to develop and contribute alternate request handlers (both update and query varieties) and further SOLRs standing as an open-source community-driven project.

          1. I really like your idea of using the URL suffix to specify the handler. But it looks like you have required this to be a fixed 2-level hierarchy, with URLs of the form

          http://<host>:<port>/<solr-root>/<action>/<handler>

          which are looked up in a handler table keyed by <action> and then <handler>. For example search/standard looks for a <requestHandler action="search" name="standard"...>, loads the indicated handler class, and associates it with the config.

          But this hierarchy seems a little overdetermined and the implementation overcomplex. It could be argued that one would want <handler>/<action>, for the <handler> alone to resolve to a handler class, and the handler class to be responsible for deciding how to act on the <action> part... but any hierarchical arrangement that makes perfect sense to one person can seem wrong to another. And in your implementation I see no actual action taken by the <action> argument other than selection of a per-<action> default <handler>, and that seems more complexity than it's worth; there are easier ways.

          I would suggest the simpler approach of simply taking the entire path after <solr-root> to be a handler configuration name, without conforming it to any fixed hierarchy, e.g. a developer could set up

          <requestHandler name="search/products/bysku" class="...">...</>
          <requestHandler name="search/products/byname" class="...">...</>
          <requestHandler name="search/companies" class="...">...</>
          <requestHandler name="index/csv" class="...">...</>

          In that way establishing a command/subcommand hierarchy is entirely up to the user's solrconfig.xml setup, and there is no imposed logic as to whether the different behavior between the 3 search examples is achieved through different config of the same handler class, different handler classes, or both.

          As for default actions, there is no need for special code, they can entirely be defined in solrconfig. For example, if a developer sets up a /search/xxx space as above, the response to a client request /search without further qualification is entirely up to what is defined in solrconfig.xml:

          • If there is no request handler defined under name="search" SOLR would return a standard "No handler found" message
          • If it has a query request handler under that name (e.g. with name="search" class="solr.StandardRequestHandler") it would get to handle less-qualified requests with developer-defined defaults.
          • It could be defined to explicitly invoke your UnavailableRequestHandler – a great idea which should be extended so the error code and error message could be custom-configured with handler config params.

          Thus I think this free-form hierarchy would achieve greater simplicity and greater flexibility at the same time.

          2. What would make this even more powerful would be the ability to "subclass" (meaning refine and/or extend) request handler configs: If the requestHandler element allowed an attribute extends="<another-requesthandler-name>" and chained the SolrParams, then one could do something like:

          <requestHandler name="search/products/all" class="solr.DisMaxRequestHandler" >
          <lst name="defaults">
          <float name="tie">0.01</float>
          <str name="qf">
          text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
          </str>
          ... much more, per the "dismax" example in the sample solrconfig.xml ...
          </requestHandler>

          ... and replacing the "partitioned" example ...
          <requestHandler name="search/products/instock" extends="search/products/all" >
          <lst name="appends">
          <str name="fq">inStock:true</str>
          </lst>
          </requestHandler>

          One could even allow the extending requestHandler to set a different handler class, in the case where the difference in behavior requires a different handler implementation but can share all or part of the params.

          3. Structuring the code and action under /add is conceptually limiting because update-style request plugins such as SQL-based or CSV-based (and certainly XML-based) should still be able to add, replace, and delete, either based on internal logic or external command.

          Your code suggests further refactoring improvements along those lines. For example, in your SQLUpdateHandler example you call:

          AddUpdateCommand cmd = UpdateUtils.getAddUpdateCommandFromParams( params );

          and then for each assembled Document

          cmd.doc = docmap.toDocument( schema );
          UpdateUtils.addDoc( cmd );
          [which does SolrCore.getSolrCore().getUpdateHandler().addDoc( cmd );]
          addedDocumentCount++;

          Lets say:

          A. We standardized on action=... as a way to define an action in a param

          B. A new method UpdateUtils.getUpdateCommandFromParams( params ) would use the action= param to decide which xxxUpdateCommand class to instantiate – though this might be better placed as a static class method getCommandFromParams defined in UpdateCommand itself.

          (Perhaps once it decodes the action param it could call another UpdateCommand static method getCommandFromString( name ), which would centralize the instantiation logic, and provide a pathway for it it to evolve into a name->class soft mapping should extensibility to other commands ever be desired.)

          C. Responsibility for initializing each update command from params would be the responsibility of the xxxUpdateCommand itself, e.g. base class UpdateCommand would define an abstract initializeFromParams( params ) which each subclass would implement.

          D. Responsibility for executing each update command would also lay with the xxxUpdateCommand via an abstract execute() method, which for AddUpdateCommand would be:

          SolrCore.getSolrCore().getUpdateHandler().addDoc( this );

          (It could be argued that some of the various UpdateHandler functionality should migrate to the xxxUpdateCommands themselves, but that's another discussion)

          E. The DeleteUpdateCommand could allow the target record to be defined by setting a Document, and using the schema keyfield definition to extract the ID. The logic could be encapsulated in DeleteUpdateCommand.execute() itself, extracting the keyfield value from the Document and setting it in its own id field before calling getUpdateHandler().delete( cmd ). Or if setters were defined for the internal fields, setDocument(doc) could call setId(id) and not need to keep a reference to the document. In either case the existing UpdateHandlers don't even need to know about this new modality.

          Then SQLUpdateHandler or CSVUpdateHandler or whatever (which should really be called xxxRequestHandler since UpdateHandler is quite another beast) would be agnostic as to whether they are doing adds or deletes of the selected rows; they could just ask for an UpdateCommand implementation, and in a loop set the constructed Document, and call updateCommand.execute(). While it would be most efficient if for deletes the query or data file only specified a single keyfield column, there is no harm in creating Documents with many more fields and simply ignoring all but the keyfield.

          Either way one could set up an explicit deletion-only CSV handler:

          <!-- Could be csv/delete, delete/csv, product/delete, or whatever -->
          <requestHandler name="csv/delete" class="solr.CSVUpdateHandler">
          <lst name="invariants">
          <str name="action">delete</str>
          </lst>
          </requestHandler>

          or more flexibly allow the parameter to be passed in a URL query param:

          /<solr-root>/update/product?action=delete&...

          4. Similarly with this structure your CommitHandler and OptimizeHandler classes can be replaced with a simple CommandHandler which is defined in solrconfig; it could even handle deletion by ID or query:

          <requestHandler name="commit-and-flush" class="solr.CommandHandler">
          <lst name="invariants">
          <str name="action">commit</str>
          <str name="waitFlush">true</str>
          </lst>
          <lst name="defaults">
          <str name="waitSearcher">false</str>
          </lst>
          </requestHandler>

          5. Of course a RequestHandler could be non-command-agnostic as well, for example the XML update parsing code now in SolrCore.update() could be recast as a RequestHandler which reads the <add>, <delete>, <commit>, <optimize> tags and decides itself which type of UpdateCommand(s) to instantiate (using UpdateCommand.getCommandFromString as described above).

          But it could be made more agnostic and simpler still, for example by parsing the action and parameters from XML into a SolrParams and calling getCommandFromParams as described above. That would forge a path for other non-XML command-stream-based update handlers by leaving only XML parsing logic in the XML update RequestHandler – all the rest of the logic would be in UpdateCommand, the UpdateCommand implementations, and UpdateHandler.

          Anyway, those are some ideas, for what they're worth.

          Show
          J.J. Larrea added a comment - I think this is a fantastic effort, so please take my comments and suggestions for improvement below in the context of my appreciation you took the time to wade deeply into the SOLR request-handling code looking for places to improve it. If the request structure is cleaned up along these lines it will make it much simpler for people to develop and contribute alternate request handlers (both update and query varieties) and further SOLRs standing as an open-source community-driven project. 1. I really like your idea of using the URL suffix to specify the handler. But it looks like you have required this to be a fixed 2-level hierarchy, with URLs of the form http://<host>:<port>/<solr-root>/<action>/<handler> which are looked up in a handler table keyed by <action> and then <handler>. For example search/standard looks for a <requestHandler action="search" name="standard"...>, loads the indicated handler class, and associates it with the config. But this hierarchy seems a little overdetermined and the implementation overcomplex. It could be argued that one would want <handler>/<action>, for the <handler> alone to resolve to a handler class, and the handler class to be responsible for deciding how to act on the <action> part... but any hierarchical arrangement that makes perfect sense to one person can seem wrong to another. And in your implementation I see no actual action taken by the <action> argument other than selection of a per-<action> default <handler>, and that seems more complexity than it's worth; there are easier ways. I would suggest the simpler approach of simply taking the entire path after <solr-root> to be a handler configuration name, without conforming it to any fixed hierarchy, e.g. a developer could set up <requestHandler name="search/products/bysku" class="...">...</> <requestHandler name="search/products/byname" class="...">...</> <requestHandler name="search/companies" class="...">...</> <requestHandler name="index/csv" class="...">...</> In that way establishing a command/subcommand hierarchy is entirely up to the user's solrconfig.xml setup, and there is no imposed logic as to whether the different behavior between the 3 search examples is achieved through different config of the same handler class, different handler classes, or both. As for default actions, there is no need for special code, they can entirely be defined in solrconfig. For example, if a developer sets up a /search/xxx space as above, the response to a client request /search without further qualification is entirely up to what is defined in solrconfig.xml: If there is no request handler defined under name="search" SOLR would return a standard "No handler found" message If it has a query request handler under that name (e.g. with name="search" class="solr.StandardRequestHandler") it would get to handle less-qualified requests with developer-defined defaults. It could be defined to explicitly invoke your UnavailableRequestHandler – a great idea which should be extended so the error code and error message could be custom-configured with handler config params. Thus I think this free-form hierarchy would achieve greater simplicity and greater flexibility at the same time. 2. What would make this even more powerful would be the ability to "subclass" (meaning refine and/or extend) request handler configs: If the requestHandler element allowed an attribute extends="<another-requesthandler-name>" and chained the SolrParams, then one could do something like: <requestHandler name="search/products/all" class="solr.DisMaxRequestHandler" > <lst name="defaults"> <float name="tie">0.01</float> <str name="qf"> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 </str> ... much more, per the "dismax" example in the sample solrconfig.xml ... </requestHandler> ... and replacing the "partitioned" example ... <requestHandler name="search/products/instock" extends="search/products/all" > <lst name="appends"> <str name="fq">inStock:true</str> </lst> </requestHandler> One could even allow the extending requestHandler to set a different handler class, in the case where the difference in behavior requires a different handler implementation but can share all or part of the params. 3. Structuring the code and action under /add is conceptually limiting because update-style request plugins such as SQL-based or CSV-based (and certainly XML-based) should still be able to add, replace, and delete, either based on internal logic or external command. Your code suggests further refactoring improvements along those lines. For example, in your SQLUpdateHandler example you call: AddUpdateCommand cmd = UpdateUtils.getAddUpdateCommandFromParams( params ); and then for each assembled Document cmd.doc = docmap.toDocument( schema ); UpdateUtils.addDoc( cmd ); [which does SolrCore.getSolrCore().getUpdateHandler().addDoc( cmd );] addedDocumentCount++; Lets say: A. We standardized on action=... as a way to define an action in a param B. A new method UpdateUtils.getUpdateCommandFromParams( params ) would use the action= param to decide which xxxUpdateCommand class to instantiate – though this might be better placed as a static class method getCommandFromParams defined in UpdateCommand itself. (Perhaps once it decodes the action param it could call another UpdateCommand static method getCommandFromString( name ), which would centralize the instantiation logic, and provide a pathway for it it to evolve into a name->class soft mapping should extensibility to other commands ever be desired.) C. Responsibility for initializing each update command from params would be the responsibility of the xxxUpdateCommand itself, e.g. base class UpdateCommand would define an abstract initializeFromParams( params ) which each subclass would implement. D. Responsibility for executing each update command would also lay with the xxxUpdateCommand via an abstract execute() method, which for AddUpdateCommand would be: SolrCore.getSolrCore().getUpdateHandler().addDoc( this ); (It could be argued that some of the various UpdateHandler functionality should migrate to the xxxUpdateCommands themselves, but that's another discussion) E. The DeleteUpdateCommand could allow the target record to be defined by setting a Document, and using the schema keyfield definition to extract the ID. The logic could be encapsulated in DeleteUpdateCommand.execute() itself, extracting the keyfield value from the Document and setting it in its own id field before calling getUpdateHandler().delete( cmd ). Or if setters were defined for the internal fields, setDocument(doc) could call setId(id) and not need to keep a reference to the document. In either case the existing UpdateHandlers don't even need to know about this new modality. Then SQLUpdateHandler or CSVUpdateHandler or whatever (which should really be called xxxRequestHandler since UpdateHandler is quite another beast) would be agnostic as to whether they are doing adds or deletes of the selected rows; they could just ask for an UpdateCommand implementation, and in a loop set the constructed Document, and call updateCommand.execute(). While it would be most efficient if for deletes the query or data file only specified a single keyfield column, there is no harm in creating Documents with many more fields and simply ignoring all but the keyfield. Either way one could set up an explicit deletion-only CSV handler: <!-- Could be csv/delete, delete/csv, product/delete, or whatever --> <requestHandler name="csv/delete" class="solr.CSVUpdateHandler"> <lst name="invariants"> <str name="action">delete</str> </lst> </requestHandler> or more flexibly allow the parameter to be passed in a URL query param: /<solr-root>/update/product?action=delete&... 4. Similarly with this structure your CommitHandler and OptimizeHandler classes can be replaced with a simple CommandHandler which is defined in solrconfig; it could even handle deletion by ID or query: <requestHandler name="commit-and-flush" class="solr.CommandHandler"> <lst name="invariants"> <str name="action">commit</str> <str name="waitFlush">true</str> </lst> <lst name="defaults"> <str name="waitSearcher">false</str> </lst> </requestHandler> 5. Of course a RequestHandler could be non-command-agnostic as well, for example the XML update parsing code now in SolrCore.update() could be recast as a RequestHandler which reads the <add>, <delete>, <commit>, <optimize> tags and decides itself which type of UpdateCommand(s) to instantiate (using UpdateCommand.getCommandFromString as described above). But it could be made more agnostic and simpler still, for example by parsing the action and parameters from XML into a SolrParams and calling getCommandFromParams as described above. That would forge a path for other non-XML command-stream-based update handlers by leaving only XML parsing logic in the XML update RequestHandler – all the rest of the logic would be in UpdateCommand, the UpdateCommand implementations, and UpdateHandler. Anyway, those are some ideas, for what they're worth.
          Hide
          J.J. Larrea added a comment -

          6. I may be going a little crazy with this soft-configuration concept, but thinking about how to support the legacy

          /select?qt=faceted...

          format leads me to think there could be a trivial (3-line handleRequestBody) NamedRequestHandler which uses one parameter to provide the name of another parameter which names another requestHandler definition which it would then invoke. With that,

          <requestHandler name="select" class="solr. NamedRequestHandler">
          <lst name="invariants">
          <str name="handlerNameParameter">qt</str>
          </lst>
          <lst name="defaults">
          <str name="qt">standard</str>
          </lst>
          </requestHandler>

          would allow /select?qt=dismax... to be soft-implemented; a developer who had no use for the non-URL-path selectors could strip it out, another developer who wanted to use a different parameter to set the handler could define it that way.

          Show
          J.J. Larrea added a comment - 6. I may be going a little crazy with this soft-configuration concept, but thinking about how to support the legacy /select?qt=faceted... format leads me to think there could be a trivial (3-line handleRequestBody) NamedRequestHandler which uses one parameter to provide the name of another parameter which names another requestHandler definition which it would then invoke. With that, <requestHandler name="select" class="solr. NamedRequestHandler"> <lst name="invariants"> <str name="handlerNameParameter">qt</str> </lst> <lst name="defaults"> <str name="qt">standard</str> </lst> </requestHandler> would allow /select?qt=dismax... to be soft-implemented; a developer who had no use for the non-URL-path selectors could strip it out, another developer who wanted to use a different parameter to set the handler could define it that way.
          Hide
          Yonik Seeley added a comment -

          I haven't had a chance to look at all this stuff yet, but we should take care to not try and implement too much.
          In some cases the right "plugin" mechanism might be the servlet spec and web.xml (made me think of it when I saw the "cookies" comment

          Show
          Yonik Seeley added a comment - I haven't had a chance to look at all this stuff yet, but we should take care to not try and implement too much. In some cases the right "plugin" mechanism might be the servlet spec and web.xml (made me think of it when I saw the "cookies" comment
          Hide
          Ryan McKinley added a comment -

          I agree with points J.J's points 1-6. thanks

          If one were to look at only one thing, the stuff i to look at woud be how Handlers get their parameters and content streams.

          from my previous post:

          I define three basic types of request handlers in: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/
          1) standard. This gets everything from parameters (get or post)
          2) posted. This gets a reader from the posted body:
          http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/PostedRequestHandler.java
          3) multipart. This gets an iterator over each file item using commons-upload streaming API
          http://jakarta.apache.org/commons/fileupload/streaming.html
          http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/MultipartRequestHandler.java

          and: http://svn.lapnap.net/solr/handler-draft/solr/src/webapp/src/org/apache/solr/servlet/SolrRequestFilter.java fills them up.

          It currently uses instanceof... any other ideas?

          Show
          Ryan McKinley added a comment - I agree with points J.J's points 1-6. thanks If one were to look at only one thing, the stuff i to look at woud be how Handlers get their parameters and content streams. from my previous post: I define three basic types of request handlers in: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/ 1) standard. This gets everything from parameters (get or post) 2) posted. This gets a reader from the posted body: http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/PostedRequestHandler.java 3) multipart. This gets an iterator over each file item using commons-upload streaming API http://jakarta.apache.org/commons/fileupload/streaming.html http://svn.lapnap.net/solr/handler-draft/solr/src/java/org/apache/solr/handler/MultipartRequestHandler.java and: http://svn.lapnap.net/solr/handler-draft/solr/src/webapp/src/org/apache/solr/servlet/SolrRequestFilter.java fills them up. It currently uses instanceof... any other ideas?
          Hide
          Ryan McKinley added a comment -

          I attached 'DispatchFilter.patch' This extracts some stuff from my previous (monstrous) refactoring that will be useful. I did not refactor anything, but I did:

          • Added SolrRequestParser interface
          • Added ContentStream interface
          • Extracted common handler code into RequestHandlerBase.
          • Added 'getRequestParser() to Handler interface. The RequestHandlerBase would return a Parser from the 'invariant' 'rp'
          • modified StandardRequest/DisMaxRequest to use the HandlerBase
          • Added Iterable<ContentStream> getContentStreams() to SolrQueryRequest
          • Added UpdateHandlerRequest that acts as /update The only difference is the returned format
          • Replaced SolrServelt and UpdateServlet with SolrRequestFilter. This now dispatches all requests

          With the example solrconfig.xml, you can run queries like:
          http://localhost:8983/solr/select/?q=solr
          http://localhost:8983/solr/standard?q=solr
          http://localhost:8983/solr/dismax?q=solr
          http://localhost:8983/solr/dismax?q=solr
          http://localhost:8983/solr/instock?q=solr
          http://localhost:8983/solr/update

          TODO:

          • actually build the RequestParser registry. currently everything returns StandardRequestParser
          • handle multipart requests in StandardRequestParser
          • move update( reader, writer ) out of SolrCore into UpdateRequestHandler
          • - - - - -
            SVN NOTE:

          when i tested this patch, I got strange results for the files it added. Specifically, new files are repeated twice.

          src\java\org\apache\solr\handler\UpdateRequestHandler.java
          package org.apache.solr.handler;
          public class UpdateRequestHandler()

          { ... }

          package org.apache.solr.handler;
          public class UpdateRequestHandler()
          { ... }

          Delete the second copy of the class it it will compile ok.

          • - - - - -

          Is it possible to delete previous file attachments? I don't think HandlerRefactoring.DRAFT.* is useful for anyone to see anymore.

          ryan

          Show
          Ryan McKinley added a comment - I attached 'DispatchFilter.patch' This extracts some stuff from my previous (monstrous) refactoring that will be useful. I did not refactor anything, but I did: Added SolrRequestParser interface Added ContentStream interface Extracted common handler code into RequestHandlerBase. Added 'getRequestParser() to Handler interface. The RequestHandlerBase would return a Parser from the 'invariant' 'rp' modified StandardRequest/DisMaxRequest to use the HandlerBase Added Iterable<ContentStream> getContentStreams() to SolrQueryRequest Added UpdateHandlerRequest that acts as /update The only difference is the returned format Replaced SolrServelt and UpdateServlet with SolrRequestFilter. This now dispatches all requests With the example solrconfig.xml, you can run queries like: http://localhost:8983/solr/select/?q=solr http://localhost:8983/solr/standard?q=solr http://localhost:8983/solr/dismax?q=solr http://localhost:8983/solr/dismax?q=solr http://localhost:8983/solr/instock?q=solr http://localhost:8983/solr/update TODO: actually build the RequestParser registry. currently everything returns StandardRequestParser handle multipart requests in StandardRequestParser move update( reader, writer ) out of SolrCore into UpdateRequestHandler - - - - - SVN NOTE: when i tested this patch, I got strange results for the files it added. Specifically, new files are repeated twice. src\java\org\apache\solr\handler\UpdateRequestHandler.java package org.apache.solr.handler; public class UpdateRequestHandler() { ... } package org.apache.solr.handler; public class UpdateRequestHandler() { ... } Delete the second copy of the class it it will compile ok. - - - - - Is it possible to delete previous file attachments? I don't think HandlerRefactoring.DRAFT.* is useful for anyone to see anymore. ryan
          Hide
          Ryan McKinley added a comment -

          removed getRequestParser() from Handler interface.

          using ':' in the URL to specify the request parser.

          http://localhost:8983/solr/standard:requestparser?q=video

          NOTE: it still uses a defalt request parser.

          Show
          Ryan McKinley added a comment - removed getRequestParser() from Handler interface. using ':' in the URL to specify the request parser. http://localhost:8983/solr/standard:requestparser?q=video NOTE: it still uses a defalt request parser.
          Hide
          Ryan McKinley added a comment -

          I just updated DispatchFilter.path to implement most of our discussion on solr-dev

          The implemented URL structure is:
          http://$

          {host}

          :$

          {port}

          /$

          {context}/${path/defined/in/solrconfig.xml}:${optional/path/for/handler}?${params}

          (If there needs to be a constant between ${context}

          and $

          {path}

          I am ok with it, but i don't think its necessary.)

          If you get this running, check:
          http://localhost:8983/solr/test.html

          This is a test page that shows the various methods to get streamed content into the handler

          • with param stream.URL - puts the content of remote url into stream
          • with stream.BODY - puts the content of the parameter into a stream
          • multipart upload. put the fields into SolrParams and the Files into streams
          • POST with no query string. - uses the fields to fill SolrParams
          • POST with query string. - uses the post body as the ContentStream, fills SolrParams from the query string

          I think this covers all the normal cases. If you can think of others, let me know. I believe things that would iterate over a huge collection of streams should be implemented as a RequestHandler, not as the RequestBuilder

          • - - - - - - - - - -

          Things to note:

          1) /select and /update are handled with their same old servlets. They have just been refactored to LegacyUpdateServlet etc. I think the example solrconfig.xml should map /update to the new framework, not the old one. This would get people who start using solr to use the new framework, but still work for people who don't map /update in their solrconfig.xml. This would also require we change the included 'post.sh' to use: URL=http://localhost:8983/solr/update?stream (so the content is read as a stream)

          2) Even when /update is mapped to the legacy servlet, you can map subfolders to the new one. I included /update/commit in this patch

          3) Configuration? Where should we configure enable/disable streams? max file upload size? upload temp directory? I REALLY think its a bad idea to enable stream.URL by default. Although the model is that solr sits in a private network, we know that is not always the case. It may also be good to configure a required user role to be able to stream. for example, stream.URL requires isUserInRole( 'admin' );

          4) Sending context to handlers. Some handlers will want/need additional information about the request (headers,user,remote host,path, etc). In this patch, I add 'path' to all requests. There should be a way for the handler to say what information it needs

          ryan

          Show
          Ryan McKinley added a comment - I just updated DispatchFilter.path to implement most of our discussion on solr-dev The implemented URL structure is: http://$ {host} :$ {port} /$ {context}/${path/defined/in/solrconfig.xml}:${optional/path/for/handler}?${params} (If there needs to be a constant between ${context} and $ {path} I am ok with it, but i don't think its necessary.) If you get this running, check: http://localhost:8983/solr/test.html This is a test page that shows the various methods to get streamed content into the handler with param stream.URL - puts the content of remote url into stream with stream.BODY - puts the content of the parameter into a stream multipart upload. put the fields into SolrParams and the Files into streams POST with no query string. - uses the fields to fill SolrParams POST with query string. - uses the post body as the ContentStream, fills SolrParams from the query string I think this covers all the normal cases. If you can think of others, let me know. I believe things that would iterate over a huge collection of streams should be implemented as a RequestHandler, not as the RequestBuilder - - - - - - - - - - Things to note: 1) /select and /update are handled with their same old servlets. They have just been refactored to LegacyUpdateServlet etc. I think the example solrconfig.xml should map /update to the new framework, not the old one. This would get people who start using solr to use the new framework, but still work for people who don't map /update in their solrconfig.xml. This would also require we change the included 'post.sh' to use: URL= http://localhost:8983/solr/update?stream (so the content is read as a stream) 2) Even when /update is mapped to the legacy servlet, you can map subfolders to the new one. I included /update/commit in this patch 3) Configuration? Where should we configure enable/disable streams? max file upload size? upload temp directory? I REALLY think its a bad idea to enable stream.URL by default. Although the model is that solr sits in a private network, we know that is not always the case. It may also be good to configure a required user role to be able to stream. for example, stream.URL requires isUserInRole( 'admin' ); 4) Sending context to handlers. Some handlers will want/need additional information about the request (headers,user,remote host,path, etc). In this patch, I add 'path' to all requests. There should be a way for the handler to say what information it needs ryan
          Hide
          Ryan McKinley added a comment -

          I just thought of something that will make Hoss' blod curl! I KNOW it
          is a bad idea for things within solr-core, but it would be the
          cleanest/cheapest way to expose the unknown things a potential
          RequestHandler would want from the HttpServletRequest without changing
          the existing API. It goes like this:

          SolrRequest solrReq = (build the solr request)
          solrReq.getContent().put( "HttpServletRequest", req );

          It would never be used by anything in core.

          The alternative I see is to give each handler some mechanism to tell
          the RequestBuilder what attributes it needs set, then have the
          RequestBuilder put those attributes in the context or solr params. In
          my opinion, that is a lot of overhead to do stuff that clearly falls
          outside of what solr-core should be doing.

          ryan

          Show
          Ryan McKinley added a comment - I just thought of something that will make Hoss' blod curl! I KNOW it is a bad idea for things within solr-core, but it would be the cleanest/cheapest way to expose the unknown things a potential RequestHandler would want from the HttpServletRequest without changing the existing API. It goes like this: SolrRequest solrReq = (build the solr request) solrReq.getContent().put( "HttpServletRequest", req ); It would never be used by anything in core. The alternative I see is to give each handler some mechanism to tell the RequestBuilder what attributes it needs set, then have the RequestBuilder put those attributes in the context or solr params. In my opinion, that is a lot of overhead to do stuff that clearly falls outside of what solr-core should be doing. ryan
          Hide
          Ryan McKinley added a comment -

          this can replace the special servlet for SOLR-85

          Show
          Ryan McKinley added a comment - this can replace the special servlet for SOLR-85
          Hide
          Ryan McKinley added a comment -

          can be implemented with RequestHandler with ContentStreams

          Show
          Ryan McKinley added a comment - can be implemented with RequestHandler with ContentStreams
          Hide
          Ryan McKinley added a comment -

          could be implemented with ContentStreams

          Show
          Ryan McKinley added a comment - could be implemented with ContentStreams
          Hide
          Ryan McKinley added a comment -

          this extracts the common things into RequestHandlerBase

          Show
          Ryan McKinley added a comment - this extracts the common things into RequestHandlerBase
          Hide
          Yonik Seeley added a comment -

          Finally got a chance to review this... it's really great stuff Ryan!

          Random questions:

          • What's the funny chars at the start of web.xml?
          • Why is there a 2MB upload limit, and why does it only apply to multi-part uploads? I think I'm missing some background in this area...

          I'm not going to try and pick out little bugs or suggest little changes to a patch of this scope and size...
          I think we should have more frequent Solr releases, and that the Solr trunk can have APIs change from one day to the next w/o having to maintaining back-compatibility.

          So since we agree on the direction, I think this patch should be committed and we should work from there.
          I'd like to hear from hoss though, since he was following along more than I was, esp at the start of that marathon thread.

          Show
          Yonik Seeley added a comment - Finally got a chance to review this... it's really great stuff Ryan! Random questions: What's the funny chars at the start of web.xml? Why is there a 2MB upload limit, and why does it only apply to multi-part uploads? I think I'm missing some background in this area... I'm not going to try and pick out little bugs or suggest little changes to a patch of this scope and size... I think we should have more frequent Solr releases, and that the Solr trunk can have APIs change from one day to the next w/o having to maintaining back-compatibility. So since we agree on the direction, I think this patch should be committed and we should work from there. I'd like to hear from hoss though, since he was following along more than I was, esp at the start of that marathon thread.
          Hide
          Yonik Seeley added a comment -

          Oh, and where possible, it would be nice to have some unit tests.
          Servlet related stuff is probably too hard, but things like stream.body should be easy.

          Show
          Yonik Seeley added a comment - Oh, and where possible, it would be nice to have some unit tests. Servlet related stuff is probably too hard, but things like stream.body should be easy.
          Hide
          Ryan McKinley added a comment -

          If there are funny chars at the start of web.xml, it was an SVN error.

          the 2MB limit is set in solrconfig.xml

          <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />

          maybe 2MB is too small as the default, but i figured it shoudl be configurable.

          Show
          Ryan McKinley added a comment - If there are funny chars at the start of web.xml, it was an SVN error. the 2MB limit is set in solrconfig.xml <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" /> maybe 2MB is too small as the default, but i figured it shoudl be configurable.
          Hide
          Ryan McKinley added a comment -

          This is minor changes to the previous filter it

          1. Adds unit tests for stream.BODY and stream.URL
          2. uses different init logic for HandlerBase and DisMax (legacy stuff)
          3. Adds Size() to the content stream. filled up if we know it otherwise not.
          4. A few comments here and there...

          Show
          Ryan McKinley added a comment - This is minor changes to the previous filter it 1. Adds unit tests for stream.BODY and stream.URL 2. uses different init logic for HandlerBase and DisMax (legacy stuff) 3. Adds Size() to the content stream. filled up if we know it otherwise not. 4. A few comments here and there...
          Hide
          Hoss Man added a comment -

          Woot! ... i think we're really close to comiting this.

          I made a hodgepodge list of comments as i read through everything, and then tried to organize them. I agree with yonik that we should feel free to commit new functionality without being afraid of needing to change the api of that functionality befor the next release, but i'm not 100% comfortable with how backwards compatible this patch is for the existing /select and /update URLs ... this may just be an issue of me being paranoid (and tired) but there's at least one code path difference.

          Anyway, here are my notes...

          Comments regarding backwards compatibility of the patch...

          • SolrCore.update(Reader,Writer) was a public method that's been
            removed ... this is probably fine, just pointing it out for the
            record.
          • SolrUpdateServlet used HttpServletRequest.getReader, the new
            UpdateRequestHandler uses an InputStreamReader arround
            HttpServletRequest.getInputStream() ... this seems bad for legacy
            update support from a char encoding standpoint.
          • While i think it's important to refactor the XML Update
            parsing out of SolrCore - I'm still not clear what is gained by
            eliminating SolrServlet and SolrUpdate. The big advantage of
            the new dispatcher being a Filter is that it can pass requests on
            that it doesn't want to deal with, so why not leave the existing
            servlets arround with only the minimum neccessary changes...
          • move SolrCore's init to Dispatcher
          • use 3 arg core.execute in SolrServlet
          • have SolrUpdateServlet call UpdateRequestHandler.update(Reader)
            and generate the legacy response XML
            ...in order to reduce the possibility of an introducing bugs
            (particularly since the existing Servlets are the one area where we
            don't have any unit tests)

          Comments regarding functionality that i think we may want to address
          before commiting (but i won't fight over if i'm the only one that cares)...

          • UpdateRequestHandler should probably renamed XmlUpdateRequestHandler
            (particularly since i expect Yonik to commit a
            CsvUpdateRequestHandler real soon now)
          • StandardRequestParser can't assume that a POST which isn't
            multipart/* should be handled by a RawRequestParser ... if the
            content type is "application/x-www-form-urlencoded" then
            SimpleRequestParser should be used (so all params from query string
            and body are included)
          • What should the expectations of
            ContentStream.getInputStream().close() be? Should the Dispatcher
            iterate over any Iterable streams when writing the output and try
            to close them, ignoring any Exceptions?
          • I'm really not fond of having SolrParams.STREAM_TYPE. Can we please,
            please leave it out for now and rely on on content-type detection?
            We can add it back in if/when we make RequestParser a public
            interface and let people register them in solrconfig.
          • I really don't think we want to open the pandoras box of putting
            the HttpServletRequest in the SolrQueryRequest ... i'd hate to put
            that in and then have to support it forever.

          Things in the current patch that aren't strictly neccessary
          for the current issue and can (should?) be commited seperately...

          • are we definitely deprecating SolrQueryResponse.getException ?
          • StandardRequestHandler and DisMaxRequestHandler have only been
            changed to subclass the new base class.
          • only whitespace changes in SolrRequestHandler.java
          • SolrServletRequest has only imports rearranged

          Things which definitely shouldn't block up the patch, but should go on
          a short term todo list...

          • see backwards compatibility comment about (Xml)UpdateRequestHandler
            using InputStreamReader without specifying a charset ... in general
            the handler should look at the ContentStream's content type to determine
            the encoding of the InputStream (and probably default to UTF-8)
          • need to work out what kind of NamedList should be returned by
            (Xml)UpdateRequestHandler.update(Reader)
          • some of the new files are missing the Apache boilerplate.
          • a use case we talked about that still isn't covered is opening local
            files as a stream ... this should be easy to add later right along
            side STREAM_URL.
          • we should fill in the getURL methods for DisMax/Standard to point at wiki
          • CommitRequestHandler should use UpdateParams.OPTIMIZE
          • the init semantics for (Xml)UpdateRequestHandler are odd: as a
            RequestHandler it's garunteed that init(NamedList) will be called, but
            instead it uses it's own private init() that's called lazily.
          • DumpRequestHandler should dump ContentStream.getSize().
          • doFilter should call parsers.parse( path, req ) as soon as it has
            the path, and then delegate to a helper method that doesn't have
            access to the HttpServletRequest ... this reduces both the
            complexity of the method, and the likelyhood of
            someone inadvertently introducing an error (mangling the POST body)
            when making future changes in the long deep method.
            The one thing to watch out for is forcing POST in legacy
            update. (assuming legacy update stays in the Filter)
          • STREAM_URL based ContentStreams can have a meaningful getSize()
            and getContentType() if we use openConnection instead of openStream.
          • STREAM_BODY based ContentStreams can get their size from the char[]
          • it's not clear to me why the interface for SolrRequestParser is...
            public SolrParams parseParamsAndFillStreams(
            final HttpServletRequest, ArrayList<ContentStream>)
            ...instead of just...
            public SolrQueryRequest(final HttpServletRequest)
            ...with the param parsing loops in
            SolrRequestParsers.buildRequestFrom
            in static utility methods (in case a RequestParser doesn't want to
            support STREAM_URL or STREAM_BODY)
          Show
          Hoss Man added a comment - Woot! ... i think we're really close to comiting this. I made a hodgepodge list of comments as i read through everything, and then tried to organize them. I agree with yonik that we should feel free to commit new functionality without being afraid of needing to change the api of that functionality befor the next release, but i'm not 100% comfortable with how backwards compatible this patch is for the existing /select and /update URLs ... this may just be an issue of me being paranoid (and tired) but there's at least one code path difference. Anyway, here are my notes... Comments regarding backwards compatibility of the patch... SolrCore.update(Reader,Writer) was a public method that's been removed ... this is probably fine, just pointing it out for the record. SolrUpdateServlet used HttpServletRequest.getReader, the new UpdateRequestHandler uses an InputStreamReader arround HttpServletRequest.getInputStream() ... this seems bad for legacy update support from a char encoding standpoint. While i think it's important to refactor the XML Update parsing out of SolrCore - I'm still not clear what is gained by eliminating SolrServlet and SolrUpdate. The big advantage of the new dispatcher being a Filter is that it can pass requests on that it doesn't want to deal with, so why not leave the existing servlets arround with only the minimum neccessary changes... move SolrCore's init to Dispatcher use 3 arg core.execute in SolrServlet have SolrUpdateServlet call UpdateRequestHandler.update(Reader) and generate the legacy response XML ...in order to reduce the possibility of an introducing bugs (particularly since the existing Servlets are the one area where we don't have any unit tests) Comments regarding functionality that i think we may want to address before commiting (but i won't fight over if i'm the only one that cares)... UpdateRequestHandler should probably renamed XmlUpdateRequestHandler (particularly since i expect Yonik to commit a CsvUpdateRequestHandler real soon now) StandardRequestParser can't assume that a POST which isn't multipart/* should be handled by a RawRequestParser ... if the content type is "application/x-www-form-urlencoded" then SimpleRequestParser should be used (so all params from query string and body are included) What should the expectations of ContentStream.getInputStream().close() be? Should the Dispatcher iterate over any Iterable streams when writing the output and try to close them, ignoring any Exceptions? I'm really not fond of having SolrParams.STREAM_TYPE. Can we please, please leave it out for now and rely on on content-type detection? We can add it back in if/when we make RequestParser a public interface and let people register them in solrconfig. I really don't think we want to open the pandoras box of putting the HttpServletRequest in the SolrQueryRequest ... i'd hate to put that in and then have to support it forever. Things in the current patch that aren't strictly neccessary for the current issue and can (should?) be commited seperately... are we definitely deprecating SolrQueryResponse.getException ? StandardRequestHandler and DisMaxRequestHandler have only been changed to subclass the new base class. only whitespace changes in SolrRequestHandler.java SolrServletRequest has only imports rearranged Things which definitely shouldn't block up the patch, but should go on a short term todo list... see backwards compatibility comment about (Xml)UpdateRequestHandler using InputStreamReader without specifying a charset ... in general the handler should look at the ContentStream's content type to determine the encoding of the InputStream (and probably default to UTF-8) need to work out what kind of NamedList should be returned by (Xml)UpdateRequestHandler.update(Reader) some of the new files are missing the Apache boilerplate. a use case we talked about that still isn't covered is opening local files as a stream ... this should be easy to add later right along side STREAM_URL. we should fill in the getURL methods for DisMax/Standard to point at wiki CommitRequestHandler should use UpdateParams.OPTIMIZE the init semantics for (Xml)UpdateRequestHandler are odd: as a RequestHandler it's garunteed that init(NamedList) will be called, but instead it uses it's own private init() that's called lazily. DumpRequestHandler should dump ContentStream.getSize(). doFilter should call parsers.parse( path, req ) as soon as it has the path, and then delegate to a helper method that doesn't have access to the HttpServletRequest ... this reduces both the complexity of the method, and the likelyhood of someone inadvertently introducing an error (mangling the POST body) when making future changes in the long deep method. The one thing to watch out for is forcing POST in legacy update. (assuming legacy update stays in the Filter) STREAM_URL based ContentStreams can have a meaningful getSize() and getContentType() if we use openConnection instead of openStream. STREAM_BODY based ContentStreams can get their size from the char[] it's not clear to me why the interface for SolrRequestParser is... public SolrParams parseParamsAndFillStreams( final HttpServletRequest, ArrayList<ContentStream>) ...instead of just... public SolrQueryRequest(final HttpServletRequest) ...with the param parsing loops in SolrRequestParsers.buildRequestFrom in static utility methods (in case a RequestParser doesn't want to support STREAM_URL or STREAM_BODY)
          Hide
          Ryan McKinley added a comment -

          Thanks for going through this!

          I'll comment on points i have answers or questions. The rest will go
          on the TODO list.

          Ok, so we should make sure to put the charset into
          ContentStream.getContentType() and open the Reader with:

          String charset = getCharset( stream.getContentType() );
          new InputStreamReader( stream.getStream(), charset );

          Sounds reasonable. I took them out because (at the time) it seemed
          clearer and has less duplicated code.

          yes. At some point it would also be good to make a stronger name
          distinction between UpdateHandler (the thing that handles the nity
          gritty lucene indexing) and the UpdateRequestHandler – but lets save
          that for another day!

          As written, the StandardRequestParser:
          1) checks if multipart
          2) checks if it has parameters in the URL (?xxx=yyy)
          if it has parameters (?xxx=yyy) then use the RawRequestParser
          otherwise it pulls parameters from the map. (SimpleRequestParser)

          To trigger raw request reading you must have a parameter on the URL.
          This was my design in response to Yonik's observation that curl puts
          "application/x-www-form-urlencoded" in the header even if it is not
          form-urlencoded encoded.

          As written, it does not rely on clients putting accurate headers
          (except for multipart) - it relies on a URL convention.

          I only put it in there to make you happy! I'll take it out and we can
          deal with it later if necessary.

          I didn't think i could get that past you! I'll take it out and save
          the pleeding for another time.

          for a local file, you can use stream.url=file:///C:/pathtofile.txt,
          for remote ones, you use stream.url=http://...

          We should have a good notice in the config warning people to have some
          security running before enabling streaming.

          I had implemented it the normal way, BUT it broke many tests (since
          they never call init). The better solution is to make sure the tests
          call init a standard way, but that got me into editing many files I
          don't quite understand, so i opted for lazy init.

          That sounds fine. Since it is a tenative private interface, i was not
          too worried about it.

          Show
          Ryan McKinley added a comment - Thanks for going through this! I'll comment on points i have answers or questions. The rest will go on the TODO list. Ok, so we should make sure to put the charset into ContentStream.getContentType() and open the Reader with: String charset = getCharset( stream.getContentType() ); new InputStreamReader( stream.getStream(), charset ); Sounds reasonable. I took them out because (at the time) it seemed clearer and has less duplicated code. yes. At some point it would also be good to make a stronger name distinction between UpdateHandler (the thing that handles the nity gritty lucene indexing) and the UpdateRequestHandler – but lets save that for another day! As written, the StandardRequestParser: 1) checks if multipart 2) checks if it has parameters in the URL (?xxx=yyy) if it has parameters (?xxx=yyy) then use the RawRequestParser otherwise it pulls parameters from the map. (SimpleRequestParser) To trigger raw request reading you must have a parameter on the URL. This was my design in response to Yonik's observation that curl puts "application/x-www-form-urlencoded" in the header even if it is not form-urlencoded encoded. As written, it does not rely on clients putting accurate headers (except for multipart) - it relies on a URL convention. I only put it in there to make you happy! I'll take it out and we can deal with it later if necessary. I didn't think i could get that past you! I'll take it out and save the pleeding for another time. for a local file, you can use stream.url= file:///C:/pathtofile.txt , for remote ones, you use stream.url= http:// ... We should have a good notice in the config warning people to have some security running before enabling streaming. I had implemented it the normal way, BUT it broke many tests (since they never call init). The better solution is to make sure the tests call init a standard way, but that got me into editing many files I don't quite understand, so i opted for lazy init. That sounds fine. Since it is a tenative private interface, i was not too worried about it.
          Hide
          Ryan McKinley added a comment -

          Hymm, this is awkward. I'm not quite sure what happened. But this is what I thought I sent. (I apologize that you already took the time to try to decipher it). I'll include the original, for clarification

          • - - - - - -

          >
          > Woot! ... i think we're really close to comiting this.
          >

          Thanks for going through this!

          I'll comment on points i have answers or questions. The rest will go
          on the TODO list.

          > - SolrUpdateServlet used HttpServletRequest.getReader, the new
          > UpdateRequestHandler uses an InputStreamReader arround
          > HttpServletRequest.getInputStream() ... this seems bad for legacy
          > update support from a char encoding standpoint.

          Ok, so we should make sure to put the charset into
          ContentStream.getContentType() and open the Reader with:

          String charset = getCharset( stream.getContentType() );
          new InputStreamReader( stream.getStream(), charset );

          > - While i think it's important to refactor the XML Update
          > parsing out of SolrCore - I'm still not clear what is gained by
          > eliminating SolrServlet and SolrUpdate. The big advantage of
          > the new dispatcher being a Filter is that it can pass requests on
          > that it doesn't want to deal with, so why not leave the existing
          > servlets arround with only the minimum neccessary changes...

          Sounds reasonable. I took them out because (at the time) it seemed
          clearer and has less duplicated code.

          > Comments regarding functionality that i think we may want to address
          > before commiting (but i won't fight over if i'm the only one that cares)...
          >
          > - UpdateRequestHandler should probably renamed XmlUpdateRequestHandler
          > (particularly since i expect Yonik to commit a
          > CsvUpdateRequestHandler real soon now)

          yes. At some point it would also be good to make a stronger name
          distinction between UpdateHandler (the thing that handles the nity
          gritty lucene indexing) and the UpdateRequestHandler – but lets save
          that for another day!

          > - StandardRequestParser can't assume that a POST which isn't
          > multipart/* should be handled by a RawRequestParser ... if the
          > content type is "application/x-www-form-urlencoded" then
          > SimpleRequestParser should be used (so all params from query string
          > and body are included)

          As written, the StandardRequestParser:
          1) checks if multipart
          2) checks if it has parameters in the URL (?xxx=yyy)
          if it has parameters (?xxx=yyy) then use the RawRequestParser
          otherwise it pulls parameters from the map. (SimpleRequestParser)

          To trigger raw request reading you must have a parameter on the URL.
          This was my design in response to Yonik's observation that curl puts
          "application/x-www-form-urlencoded" in the header even if it is not
          form-urlencoded encoded.

          As written, it does not rely on clients putting accurate headers
          (except for multipart) - it relies on a URL convention.

          > - I'm really not fond of having SolrParams.STREAM_TYPE. Can we please,
          > please leave it out for now and rely on on content-type detection?
          > We can add it back in if/when we make RequestParser a public
          > interface and let people register them in solrconfig.

          I only put it in there to make you happy! I'll take it out and we can
          deal with it later if necessary.

          > - I really don't think we want to open the pandoras box of putting
          > the HttpServletRequest in the SolrQueryRequest ... i'd hate to put
          > that in and then have to support it forever.

          I didn't think i could get that past you! I'll take it out and save
          the pleading for another time.

          > - a use case we talked about that still isn't covered is opening local
          > files as a stream ... this should be easy to add later right along
          > side STREAM_URL.

          for a local file, you can use stream.url=file:///C:/pathtofile.txt,
          for remote ones, you use stream.url=http://...

          We should have a good notice in the config warning people to have some
          security running before enabling streaming.

          > - the init semantics for (Xml)UpdateRequestHandler are odd: as a
          > RequestHandler it's garunteed that init(NamedList) will be called, but
          > instead it uses it's own private init() that's called lazily.

          I had implemented it the normal way, BUT it broke many tests (since
          they never call init). The better solution is to make sure the tests
          call init a standard way, but that got me into editing many files I
          don't quite understand, so i opted for lazy init.

          > - it's not clear to me why the interface for SolrRequestParser is...
          > public SolrParams parseParamsAndFillStreams(
          > final HttpServletRequest, ArrayList<ContentStream>)
          > ...instead of just...
          > public SolrQueryRequest(final HttpServletRequest)
          > ...with the param parsing loops in
          > SolrRequestParsers.buildRequestFrom
          > in static utility methods (in case a RequestParser doesn't want to
          > support STREAM_URL or STREAM_BODY)
          >

          That sounds fine. Since it is a tentative private interface, i was not
          too worried about it.

          Show
          Ryan McKinley added a comment - Hymm, this is awkward. I'm not quite sure what happened. But this is what I thought I sent. (I apologize that you already took the time to try to decipher it). I'll include the original, for clarification - - - - - - > > Woot! ... i think we're really close to comiting this. > Thanks for going through this! I'll comment on points i have answers or questions. The rest will go on the TODO list. > - SolrUpdateServlet used HttpServletRequest.getReader, the new > UpdateRequestHandler uses an InputStreamReader arround > HttpServletRequest.getInputStream() ... this seems bad for legacy > update support from a char encoding standpoint. Ok, so we should make sure to put the charset into ContentStream.getContentType() and open the Reader with: String charset = getCharset( stream.getContentType() ); new InputStreamReader( stream.getStream(), charset ); > - While i think it's important to refactor the XML Update > parsing out of SolrCore - I'm still not clear what is gained by > eliminating SolrServlet and SolrUpdate. The big advantage of > the new dispatcher being a Filter is that it can pass requests on > that it doesn't want to deal with, so why not leave the existing > servlets arround with only the minimum neccessary changes... Sounds reasonable. I took them out because (at the time) it seemed clearer and has less duplicated code. > Comments regarding functionality that i think we may want to address > before commiting (but i won't fight over if i'm the only one that cares)... > > - UpdateRequestHandler should probably renamed XmlUpdateRequestHandler > (particularly since i expect Yonik to commit a > CsvUpdateRequestHandler real soon now) yes. At some point it would also be good to make a stronger name distinction between UpdateHandler (the thing that handles the nity gritty lucene indexing) and the UpdateRequestHandler – but lets save that for another day! > - StandardRequestParser can't assume that a POST which isn't > multipart/* should be handled by a RawRequestParser ... if the > content type is "application/x-www-form-urlencoded" then > SimpleRequestParser should be used (so all params from query string > and body are included) As written, the StandardRequestParser: 1) checks if multipart 2) checks if it has parameters in the URL (?xxx=yyy) if it has parameters (?xxx=yyy) then use the RawRequestParser otherwise it pulls parameters from the map. (SimpleRequestParser) To trigger raw request reading you must have a parameter on the URL. This was my design in response to Yonik's observation that curl puts "application/x-www-form-urlencoded" in the header even if it is not form-urlencoded encoded. As written, it does not rely on clients putting accurate headers (except for multipart) - it relies on a URL convention. > - I'm really not fond of having SolrParams.STREAM_TYPE. Can we please, > please leave it out for now and rely on on content-type detection? > We can add it back in if/when we make RequestParser a public > interface and let people register them in solrconfig. I only put it in there to make you happy! I'll take it out and we can deal with it later if necessary. > - I really don't think we want to open the pandoras box of putting > the HttpServletRequest in the SolrQueryRequest ... i'd hate to put > that in and then have to support it forever. I didn't think i could get that past you! I'll take it out and save the pleading for another time. > - a use case we talked about that still isn't covered is opening local > files as a stream ... this should be easy to add later right along > side STREAM_URL. for a local file, you can use stream.url= file:///C:/pathtofile.txt , for remote ones, you use stream.url= http:// ... We should have a good notice in the config warning people to have some security running before enabling streaming. > - the init semantics for (Xml)UpdateRequestHandler are odd: as a > RequestHandler it's garunteed that init(NamedList) will be called, but > instead it uses it's own private init() that's called lazily. I had implemented it the normal way, BUT it broke many tests (since they never call init). The better solution is to make sure the tests call init a standard way, but that got me into editing many files I don't quite understand, so i opted for lazy init. > - it's not clear to me why the interface for SolrRequestParser is... > public SolrParams parseParamsAndFillStreams( > final HttpServletRequest, ArrayList<ContentStream>) > ...instead of just... > public SolrQueryRequest(final HttpServletRequest) > ...with the param parsing loops in > SolrRequestParsers.buildRequestFrom > in static utility methods (in case a RequestParser doesn't want to > support STREAM_URL or STREAM_BODY) > That sounds fine. Since it is a tentative private interface, i was not too worried about it.
          Hide
          Ryan McKinley added a comment -

          Ok, I just uploaded (hopefully) a final version. It takes care of (most) everything you suggest.

          > - SolrUpdateServlet used HttpServletRequest.getReader, the new
          > UpdateRequestHandler uses an InputStreamReader arround
          > HttpServletRequest.getInputStream() ... this seems bad for legacy
          > update support from a char encoding standpoint.

          Can you check: XmlUpdateRequestHandler.handleRequestBody();

          I added a check for charset, but am not totally confident in the method.

          > - StandardRequestParser can't assume that a POST which isn't
          > multipart/* should be handled by a RawRequestParser ... if the
          > content type is "application/x-www-form-urlencoded" then
          > SimpleRequestParser should be used (so all params from query string
          > and body are included)

          done.

          > - What should the expectations of
          > ContentStream.getInputStream().close() be? Should the Dispatcher
          > iterate over any Iterable streams when writing the output and try
          > to close them, ignoring any Exceptions?

          Each handler is responsible to make sure they are closed. Since it is an Iterable<ContentStream> - I don't think it is OK to iterate a second time. If we make it a Collection<> then that would be a reasonable behavior.

          >
          > Things in the current patch that aren't strictly neccessary
          > for the current issue and can (should?) be commited seperately...
          >
          > - are we definitely deprecating SolrQueryResponse.getException ?

          I took it out. we can commit it separately.

          > - StandardRequestHandler and DisMaxRequestHandler have only been
          > changed to subclass the new base class.

          I left it in. You can ignore changes to StandardRequestHandler and DisMax if that makes more sense.

          > - doFilter should call parsers.parse( path, req ) as soon as it has
          > the path, and then delegate to a helper method that doesn't have
          > access to the HttpServletRequest ... this reduces both the
          > complexity of the method, and the likelyhood of
          > someone inadvertently introducing an error (mangling the POST body)
          > when making future changes in the long deep method.
          > The one thing to watch out for is forcing POST in legacy
          > update. (assuming legacy update stays in the Filter)

          I removed the legacy handling from SolrDispatchFilter - this simplifies it greatly, so i'm not sure if you still think this is necessary.

          > - it's not clear to me why the interface for SolrRequestParser is...
          > public SolrParams parseParamsAndFillStreams(
          > final HttpServletRequest, ArrayList<ContentStream>)
          > ...instead of just...
          > public SolrQueryRequest(final HttpServletRequest)
          > ...with the param parsing loops in
          > SolrRequestParsers.buildRequestFrom
          > in static utility methods (in case a RequestParser doesn't want to
          > support STREAM_URL or STREAM_BODY)
          >

          I left it as is for now. Since it is a private interface, we should feel free to clean it up / change it as we see fit. But now I go to sleep!

          Show
          Ryan McKinley added a comment - Ok, I just uploaded (hopefully) a final version. It takes care of (most) everything you suggest. > - SolrUpdateServlet used HttpServletRequest.getReader, the new > UpdateRequestHandler uses an InputStreamReader arround > HttpServletRequest.getInputStream() ... this seems bad for legacy > update support from a char encoding standpoint. Can you check: XmlUpdateRequestHandler.handleRequestBody(); I added a check for charset, but am not totally confident in the method. > - StandardRequestParser can't assume that a POST which isn't > multipart/* should be handled by a RawRequestParser ... if the > content type is "application/x-www-form-urlencoded" then > SimpleRequestParser should be used (so all params from query string > and body are included) done. > - What should the expectations of > ContentStream.getInputStream().close() be? Should the Dispatcher > iterate over any Iterable streams when writing the output and try > to close them, ignoring any Exceptions? Each handler is responsible to make sure they are closed. Since it is an Iterable<ContentStream> - I don't think it is OK to iterate a second time. If we make it a Collection<> then that would be a reasonable behavior. > > Things in the current patch that aren't strictly neccessary > for the current issue and can (should?) be commited seperately... > > - are we definitely deprecating SolrQueryResponse.getException ? I took it out. we can commit it separately. > - StandardRequestHandler and DisMaxRequestHandler have only been > changed to subclass the new base class. I left it in. You can ignore changes to StandardRequestHandler and DisMax if that makes more sense. > - doFilter should call parsers.parse( path, req ) as soon as it has > the path, and then delegate to a helper method that doesn't have > access to the HttpServletRequest ... this reduces both the > complexity of the method, and the likelyhood of > someone inadvertently introducing an error (mangling the POST body) > when making future changes in the long deep method. > The one thing to watch out for is forcing POST in legacy > update. (assuming legacy update stays in the Filter) I removed the legacy handling from SolrDispatchFilter - this simplifies it greatly, so i'm not sure if you still think this is necessary. > - it's not clear to me why the interface for SolrRequestParser is... > public SolrParams parseParamsAndFillStreams( > final HttpServletRequest, ArrayList<ContentStream>) > ...instead of just... > public SolrQueryRequest(final HttpServletRequest) > ...with the param parsing loops in > SolrRequestParsers.buildRequestFrom > in static utility methods (in case a RequestParser doesn't want to > support STREAM_URL or STREAM_BODY) > I left it as is for now. Since it is a private interface, we should feel free to clean it up / change it as we see fit. But now I go to sleep!
          Hide
          Hoss Man added a comment -

          Kick Ass!

          Ryan, Everything looks great ... I've commited a minor variation on your latest patch.

          Modifications I made to your patch before commiting...

          • Removed an empty handleRequest method from SolrDispatchFilter
          • cleaned up some funky chars in web.xml
          • moved LICENSE info to top of all new files (before package declaration)
          • refactored some duplicate code from SolrUpdateServlet and TestHarness
            into XmlUpdateRequestHandler.doLegacyUpdate
          • did not commit test.html
          • I also set svn:keywords on all of your new RequestHandlers, as well
            as the DisMaxRequestHandler (aparently it had never been set there,
            so the keywords weren't getting replaced properly)

          (The charset detection from content-type seems good enough for now ... there's some more crazy stuff that can be done diferentiating between text/xml and application/xml and inspecting the InputStream itself for specific indicator bytes – but that can be dealt with in SOLR-96).

          Thanks for putting in all of this work, and sticking with it through my (most likely extremely paranoid) critical comments ... what you've done is really going to help catapult Solr's flexibility.

          Show
          Hoss Man added a comment - Kick Ass! Ryan, Everything looks great ... I've commited a minor variation on your latest patch. Modifications I made to your patch before commiting... Removed an empty handleRequest method from SolrDispatchFilter cleaned up some funky chars in web.xml moved LICENSE info to top of all new files (before package declaration) refactored some duplicate code from SolrUpdateServlet and TestHarness into XmlUpdateRequestHandler.doLegacyUpdate did not commit test.html I also set svn:keywords on all of your new RequestHandlers, as well as the DisMaxRequestHandler (aparently it had never been set there, so the keywords weren't getting replaced properly) (The charset detection from content-type seems good enough for now ... there's some more crazy stuff that can be done diferentiating between text/xml and application/xml and inspecting the InputStream itself for specific indicator bytes – but that can be dealt with in SOLR-96 ). Thanks for putting in all of this work, and sticking with it through my (most likely extremely paranoid) critical comments ... what you've done is really going to help catapult Solr's flexibility.
          Hide
          Ryan McKinley added a comment -

          Thanks Hoss!

          Show
          Ryan McKinley added a comment - Thanks Hoss!

            People

            • Assignee:
              Hoss Man
              Reporter:
              Ryan McKinley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development