Solr
  1. Solr
  2. SOLR-2857

Multi-content-type /update handler

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Something I've been thinking about lately... it'd be great to get rid of all the specific update handlers like /update/csv, /update/extract, and /update/json and collapse them all into a single /update that underneath uses the content-type(s) to hand off to specific content handlers. This would make it much easier to toss content at Solr and provide a single entry point for updates.

      1. SOLR-2857-update-content-type.patch
        181 kB
        Ryan McKinley
      2. SOLR-2857-update-content-type.patch
        117 kB
        Ryan McKinley
      3. SOLR-2857-update-content-type.patch
        191 kB
        Ryan McKinley
      4. SOLR-2857-content-type-refactor.patch
        205 kB
        Ryan McKinley
      5. SOLR-2857-content-type-refactor.patch
        191 kB
        Ryan McKinley
      6. SOLR-2857-content-type-refactor.patch
        208 kB
        Ryan McKinley
      7. SOLR-2857_oldUrls.patch
        2 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Erik Hatcher added a comment -

          no such actual implementation in mind here, just an idea at this point.

          Show
          Erik Hatcher added a comment - no such actual implementation in mind here, just an idea at this point.
          Hide
          Chris Male added a comment -

          Fantastic idea +1

          Show
          Chris Male added a comment - Fantastic idea +1
          Hide
          Jan Høydahl added a comment -

          Yep, I've been thinking about the same. Some kind of DelegatingUpdateRequestHandler which could be the default mapped to /update. It could default to xml with a log warning, if no mimetype detected, for back compat.

          Questions:

          • How to distinguish between XML and XSLT updates? Both are 'text/xml'?
          • When would the ExtractingRequestHandler be selected? On all unknown types? What if you want to index an XML as verbatim text, but it has the XML content type?
          • There should perhaps be a parameter to let you override the auto-detection?
          Show
          Jan Høydahl added a comment - Yep, I've been thinking about the same. Some kind of DelegatingUpdateRequestHandler which could be the default mapped to /update. It could default to xml with a log warning, if no mimetype detected, for back compat. Questions: How to distinguish between XML and XSLT updates? Both are 'text/xml'? When would the ExtractingRequestHandler be selected? On all unknown types? What if you want to index an XML as verbatim text, but it has the XML content type? There should perhaps be a parameter to let you override the auto-detection?
          Hide
          Chris Male added a comment -

          Why not let the UpdateRequestHandlers be configured to the mime types that they should be used for? So in the configuration of the DelegatingUpdateRequestHandler you'd provide a mapping between mime type and UpdateRequestHandler. If you provided no mapping then we'd just use whichever supported the mime type of the request.

          Show
          Chris Male added a comment - Why not let the UpdateRequestHandlers be configured to the mime types that they should be used for? So in the configuration of the DelegatingUpdateRequestHandler you'd provide a mapping between mime type and UpdateRequestHandler. If you provided no mapping then we'd just use whichever supported the mime type of the request.
          Hide
          Erik Hatcher added a comment -

          Rather than building this around delegating to other update handlers, I'd rather see it built at least using pluggable ContentStreamLoader's. The granularity is that a ContentStreamLoader is per-stream, whereas the current ContentStreamHandlerBase extending classes handle the full request and all content streams. I envision a general purpose /update being able to hand each stream off to different loaders, rather than simply delegate the whole request to a handler.

          Then we tie loaders to content-type's as a solrconfig plugin kinda thing, with of course our built-in ones auto-registered. Perhaps we even add an "ContentType[] accepts(SolrQueryRequest req)" (where ContentType is just a String maybe?, req param desired?) to ContentStreamLoader so that content loaders can be auto-registered to the types they accept? Maybe then we need ContentStreamLoaderFactory's to get these things constructed from solrconfig with params? Looks like all ContentStreamLoader's also use UpdateRequestProcessor so that looks like a candidate to pull up to the base class, eh?

          ... just thinking out loud.

          Show
          Erik Hatcher added a comment - Rather than building this around delegating to other update handlers, I'd rather see it built at least using pluggable ContentStreamLoader's. The granularity is that a ContentStreamLoader is per-stream, whereas the current ContentStreamHandlerBase extending classes handle the full request and all content streams. I envision a general purpose /update being able to hand each stream off to different loaders, rather than simply delegate the whole request to a handler. Then we tie loaders to content-type's as a solrconfig plugin kinda thing, with of course our built-in ones auto-registered. Perhaps we even add an "ContentType[] accepts(SolrQueryRequest req)" (where ContentType is just a String maybe?, req param desired?) to ContentStreamLoader so that content loaders can be auto-registered to the types they accept? Maybe then we need ContentStreamLoaderFactory's to get these things constructed from solrconfig with params? Looks like all ContentStreamLoader's also use UpdateRequestProcessor so that looks like a candidate to pull up to the base class, eh? ... just thinking out loud.
          Hide
          Chris Male added a comment -

          I really like the idea of using ContentStreamLoaders, they seem quite lite-weight. Having a simple Factory seems to serve any configuration and instance creation issues.

          +1

          Show
          Chris Male added a comment - I really like the idea of using ContentStreamLoaders, they seem quite lite-weight. Having a simple Factory seems to serve any configuration and instance creation issues. +1
          Hide
          Erik Hatcher added a comment -

          A note of caution here... content-type cannot be determined without first getting the stream (or reader) with some recent changes on trunk (to avoid unused content streams hitting URLs unnecessarily). So the "director" aspect of a new catch-all /update will need to get the stream to determine the type, then hand that stream off to a loader. This seems like it'd entail some other refactorings to accommodate?

          Show
          Erik Hatcher added a comment - A note of caution here... content-type cannot be determined without first getting the stream (or reader) with some recent changes on trunk (to avoid unused content streams hitting URLs unnecessarily). So the "director" aspect of a new catch-all /update will need to get the stream to determine the type, then hand that stream off to a loader. This seems like it'd entail some other refactorings to accommodate?
          Hide
          Erik Hatcher added a comment - - edited

          When would the ExtractingRequestHandler be selected? On all unknown types?

          Tika can report the types it has registered to handle, IIRC. Can we just leverage that list?

          Show
          Erik Hatcher added a comment - - edited When would the ExtractingRequestHandler be selected? On all unknown types? Tika can report the types it has registered to handle, IIRC. Can we just leverage that list?
          Hide
          Erik Hatcher added a comment -

          There should perhaps be a parameter to let you override the auto-detection?

          But of course! And the detector itself can be made easily pluggable.

          Show
          Erik Hatcher added a comment - There should perhaps be a parameter to let you override the auto-detection? But of course! And the detector itself can be made easily pluggable.
          Hide
          Hoss Man added a comment -

          Regardless of the implementation...

          • How to distinguish between XML and XSLT updates? Both are 'text/xml'?
          • When would the ExtractingRequestHandler be selected? On all unknown types? What if you want to index an XML as verbatim text, but it has the XML content type?
          • There should perhaps be a parameter to let you override the auto-detection?

          ..these seem like the kinds of situations where you would make an explicit request to an explicitly named handler, instead of relying on content-type detection (ie: if we remap "/update" to some new snazzier handler, and someone says "but i want to do an xslt based update, not an xml update" then they can POST to "/update/xslt" instead. etc....)

          Show
          Hoss Man added a comment - Regardless of the implementation... How to distinguish between XML and XSLT updates? Both are 'text/xml'? When would the ExtractingRequestHandler be selected? On all unknown types? What if you want to index an XML as verbatim text, but it has the XML content type? There should perhaps be a parameter to let you override the auto-detection? ..these seem like the kinds of situations where you would make an explicit request to an explicitly named handler, instead of relying on content-type detection (ie: if we remap "/update" to some new snazzier handler, and someone says "but i want to do an xslt based update, not an xml update" then they can POST to "/update/xslt" instead. etc....)
          Hide
          Jan Høydahl added a comment -

          For the text/xml case, if "tr" is specified, we use XSLT loader, else we assume the normal XML handler.

          Show
          Jan Høydahl added a comment - For the text/xml case, if "tr" is specified, we use XSLT loader, else we assume the normal XML handler.
          Hide
          Ryan McKinley added a comment -

          Here is a patch that moves XML,CSV,JSON,javabin and XML+XSLT into a single handler that picks the right Loader based on the content type.

          The bulk of the patch is cleaning up the test config files and moving private inner classes to their own file.

          The single endpoint simplifies the oddities of multiple request formats in SolrServers

          This handles XSLT by looking for the 'tr' param when the content type is XML

          Show
          Ryan McKinley added a comment - Here is a patch that moves XML,CSV,JSON,javabin and XML+XSLT into a single handler that picks the right Loader based on the content type. The bulk of the patch is cleaning up the test config files and moving private inner classes to their own file. The single endpoint simplifies the oddities of multiple request formats in SolrServers This handles XSLT by looking for the 'tr' param when the content type is XML
          Hide
          Ryan McKinley added a comment -

          re ExtractingRequestHandler – I think this should stay its own Handler.

          There should perhaps be a parameter to let you override the auto-detection?

          isn't that the content type?

          This patch uses a lenient mime type matching approach sketched in SOLR-3387

          Show
          Ryan McKinley added a comment - re ExtractingRequestHandler – I think this should stay its own Handler. There should perhaps be a parameter to let you override the auto-detection? isn't that the content type? This patch uses a lenient mime type matching approach sketched in SOLR-3387
          Hide
          Ryan McKinley added a comment -

          What are thoughts on deprecation? I think we should go ahead and drop the specific flavors.

          Now that I think about the override parameter (and deprecation) I think it may make sense to have something like:

            <requestHandler name="/update/json" class="solr.UpdateRequestHandler">
              <str name="force.contentType">application/json</str>
            </requestHandler>
          

          This would make it behave like the 3.x JsonUpdateRequestHandler

          Show
          Ryan McKinley added a comment - What are thoughts on deprecation? I think we should go ahead and drop the specific flavors. Now that I think about the override parameter (and deprecation) I think it may make sense to have something like: <requestHandler name= "/update/json" class= "solr.UpdateRequestHandler" > <str name= "force.contentType" > application/json </str> </requestHandler> This would make it behave like the 3.x JsonUpdateRequestHandler
          Hide
          Ryan McKinley added a comment -

          cleans things up a bit more, but the UpdateRequestHandler is still hardcoded for CSV,XML,JSON, and javabin

          Next I will move more things around and clean up the API so it promotes registering Map<String,ContentStreamLoader>

          Show
          Ryan McKinley added a comment - cleans things up a bit more, but the UpdateRequestHandler is still hardcoded for CSV,XML,JSON, and javabin Next I will move more things around and clean up the API so it promotes registering Map<String,ContentStreamLoader>
          Hide
          Ryan McKinley added a comment -

          last patch was missing files

          Show
          Ryan McKinley added a comment - last patch was missing files
          Hide
          Ryan McKinley added a comment -

          Here is another version that refactors the base classes to better support reuse and a registry

              loaders.put("application/xml", new XMLLoader().init(p) );
              loaders.put("application/javabin", new JavabinLoader().init(p) );
              loaders.put("application/json", new JsonLoader().init(p) );
              loaders.put("text/csv", new CSVLoader().init(p) );
          

          Are people worried about back-compatibility of the ContentStreamLoader classes? Ignoring 3.x API will make the final solution much cleaner

          Show
          Ryan McKinley added a comment - Here is another version that refactors the base classes to better support reuse and a registry loaders.put( "application/xml" , new XMLLoader().init(p) ); loaders.put( "application/javabin" , new JavabinLoader().init(p) ); loaders.put( "application/json" , new JsonLoader().init(p) ); loaders.put( "text/csv" , new CSVLoader().init(p) ); Are people worried about back-compatibility of the ContentStreamLoader classes? Ignoring 3.x API will make the final solution much cleaner
          Hide
          Ryan McKinley added a comment -

          This patch is updated to trunk

          I would like to commit this in the next couple days, and (possibly) look at configuration in a later issue.

          Show
          Ryan McKinley added a comment - This patch is updated to trunk I would like to commit this in the next couple days, and (possibly) look at configuration in a later issue.
          Hide
          Ryan McKinley added a comment -

          updated with a deprecated replacement for XmlUpdateRequestHandler

          Show
          Ryan McKinley added a comment - updated with a deprecated replacement for XmlUpdateRequestHandler
          Hide
          Ryan McKinley added a comment -

          I added this in #1335768, and have put rough docs on the wiki – I'm sure there are more links/references that should be updated

          Show
          Ryan McKinley added a comment - I added this in #1335768, and have put rough docs on the wiki – I'm sure there are more links/references that should be updated
          Hide
          Yonik Seeley added a comment -

          I've noticed the lack of the /update/json URL has bit a few people (including me in the past).
          Should we add something for easier back compat by default?

          Show
          Yonik Seeley added a comment - I've noticed the lack of the /update/json URL has bit a few people (including me in the past). Should we add something for easier back compat by default?
          Hide
          Yonik Seeley added a comment -

          Here's a patch that restores the /update/json and /update/csv URLs so other clients/applications will have an easier time moving to Solr 4.

          Show
          Yonik Seeley added a comment - Here's a patch that restores the /update/json and /update/csv URLs so other clients/applications will have an easier time moving to Solr 4.
          Hide
          Yonik Seeley added a comment -
          Show
          Yonik Seeley added a comment - Committed to trunk & 4x: http://svn.apache.org/viewvc?view=revision&revision=1373904

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Erik Hatcher
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development