Solr
  1. Solr
  2. SOLR-6617

/update/json/docs handler needs to do a better job with tweet like JSON structures

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      SOLR-6304 allows me to send in arbitrary JSON document and have Solr do something reasonable with it. I tried this with a simple tweet and got a weird error:

      curl "http://localhost:8983/solr/tutorial/update/json/docs" -H 'Content-type:application/json' -d @sample_tweet.json
      
      {"responseHeader":{"status":400,"QTime":11},"error":{"msg":"Document contains multiple values for uniqueKey field: id=[14065694, 136447843652214784]","code":400}}
      

      Here's the tweet I'm trying to index:

      {
              "user": {
                  "name": "John Doe",
                  "screen_name": "example",
                  "lang": "en",
                  "time_zone": "London",
                  "listed_count": 221,
                  "id": 14065694,
                  "geo_enabled": true
              },
              "id": "136447843652214784",
              "text": "Morning San Francisco - 36 hours and counting.. #datasift",
              "created_at": "Tue, 15 Nov 2011 14:17:55 +0000"
      }
      

      The error is because the nested user object within the tweet also has an "id" field. So then I tried to map /user/id to user_id_s via:

      curl "http://localhost:8983/solr/tutorial/update/json/docs?f=user_id_s:/user/id" -H 'Content-type:application/json' -d @sample_tweet.json
      {"responseHeader":{"status":400,"QTime":0},"error":{"msg":"Document is missing mandatory uniqueKey field: id","code":400}}
      

      So then I added the mapping for id explicitly and it worked:

      curl "http://localhost:8983/solr/tutorial/update/json/docs?f=id:/id&f=user_id_s:/user/id" -H 'Content-type:application/json' -d @sample_tweet.json
      {"responseHeader":{"status":0,"QTime":25}}

      Working through this wasn't terrible but our goal with features like this is to have Solr make good decisions when possible to ease the new user's burden of getting to know Solr.

      I'm just wondering if the reasonable thing to do wouldn't be to map the user fields with user_ prefix? ie /user/id becomes user_id automatically.

      Lastly, I wanted to use field guessing with this so my JSON document gets indexed in a reasonable way and the only data that got indexed is:

      {
              "user_id_s": "14065694",
              "id": "136447843652214784",
              "_version_": 1481614081193410600
      }
      

      So I explicitly defined the /update/json/docs request handler in my solrconfig.xml as:

        <requestHandler name="/update/json/docs" class="solr.UpdateRequestHandler">
              <lst name="defaults">
               <str name="update.chain">add-unknown-fields-to-the-schema</str>
               <str name="stream.contentType">application/json</str>
             </lst>
        </requestHandler>
      

      Same result - no field guessing! (this is using the schemaless example config)

        Activity

        Hide
        Noble Paul added a comment - - edited

        The behavior is expected (but not desirable)
        Let me explain why this happens. by default (in the absence of any 'f' parameter) , the value of "f=/**" . So all values are mapped with their corresponding names in the input json.

        so the following code should have worked

        curl "http://localhost:8983/solr/tutorial/update/json/docs?f=user_id_s:/user/id&f=/**" -H 'Content-type:application/json' -d @sample_tweet.json
        

        One solution I can think of is make f=/** do fully expanded name as the key with a reasonable delimiter . so all the field names become user.screen_name , user.lang etc

        If necessary , we can provide a switch to simple names with a flag

        Show
        Noble Paul added a comment - - edited The behavior is expected (but not desirable) Let me explain why this happens. by default (in the absence of any 'f' parameter) , the value of "f=/**" . So all values are mapped with their corresponding names in the input json. so the following code should have worked curl "http: //localhost:8983/solr/tutorial/update/json/docs?f=user_id_s:/user/id&f=/**" -H 'Content-type:application/json' -d @sample_tweet.json One solution I can think of is make f=/** do fully expanded name as the key with a reasonable delimiter . so all the field names become user.screen_name , user.lang etc If necessary , we can provide a switch to simple names with a flag
        Hide
        Noble Paul added a comment - - edited

        This adds new functionality to the parser f=$FQN:/** . The $FQN creates the name as a fully qualified name. the /update/json/docs will start using f=$FQN:/** instead of f=/**

        $FQN will append all the parent names and use "." as a delimiter

        Show
        Noble Paul added a comment - - edited This adds new functionality to the parser f=$FQN:/** . The $FQN creates the name as a fully qualified name. the /update/json/docs will start using f=$FQN:/** instead of f=/** $FQN will append all the parent names and use "." as a delimiter
        Hide
        Shalin Shekhar Mangar added a comment -

        I can see why you did not choose a simple parameter to enable FQN vs NAME. This makes mapping even more powerful because we can now choose how to certain nested sections individually. We'll need to document that we will use FQN by default because it breaks backward-compatibility with the previous release.

        Show
        Shalin Shekhar Mangar added a comment - I can see why you did not choose a simple parameter to enable FQN vs NAME. This makes mapping even more powerful because we can now choose how to certain nested sections individually. We'll need to document that we will use FQN by default because it breaks backward-compatibility with the previous release.
        Hide
        Timothy Potter added a comment -

        Patch looks good Noble Paul. I applied this to my test scenario:

        curl "http://localhost:8983/solr/tutorial/update/json/docs" -H 'Content-type:application/json' -d @sample_tweet.json
        

        Resulted in:

        {
                "user.name": [
                  "Stewart Townsend"
                ],
                "user.url": [
                  "http://www.stewarttownsend.com"
                ],
                "user.description": [
                  "Developer Relations at Datasift (www.datasift.com)  - Car racing petrol head, all things social lover, co-founder of www.flowerytweetup.com"
                ],
                "user.location": [
                  "iPhone: 53.852402,-2.220047"
                ],
                "user.statuses_count": [
                  28247
                ],
                "user.followers_count": [
                  3094
                ],
                "user.friends_count": [
                  510
                ],
                "user.screen_name": [
                  "stewarttownsend"
                ],
                "user.lang": [
                  "en"
                ],
                "user.time_zone": [
                  "London"
                ],
                "user.listed_count": [
                  221
                ],
                "user.id": [
                  14065694
                ],
                "user.id_str": [
                  14065694
                ],
                "user.geo_enabled": [
                  true
                ],
                "id": "136447843652214784",
                "text": [
                  "Morning San Francisco - 36 hours and counting.. #datasift"
                ],
                "source": [
                  "<a href=\"http://www.tweetdeck.com\" rel=\"nofollow\">TweetDeck</a>"
                ],
                "created_at": [
                  "Tue, 15 Nov 2011 14:17:55 +0000"
                ],
                "_version_": 1481875073806631000
              }
        

        Which I'd say is very reasonable behavior on Solr's part. +1 for commit

        Show
        Timothy Potter added a comment - Patch looks good Noble Paul . I applied this to my test scenario: curl "http: //localhost:8983/solr/tutorial/update/json/docs" -H 'Content-type:application/json' -d @sample_tweet.json Resulted in: { "user.name" : [ "Stewart Townsend" ], "user.url" : [ "http: //www.stewarttownsend.com" ], "user.description" : [ "Developer Relations at Datasift (www.datasift.com) - Car racing petrol head, all things social lover, co-founder of www.flowerytweetup.com" ], "user.location" : [ "iPhone: 53.852402,-2.220047" ], "user.statuses_count" : [ 28247 ], "user.followers_count" : [ 3094 ], "user.friends_count" : [ 510 ], "user.screen_name" : [ "stewarttownsend" ], "user.lang" : [ "en" ], "user.time_zone" : [ "London" ], "user.listed_count" : [ 221 ], "user.id" : [ 14065694 ], "user.id_str" : [ 14065694 ], "user.geo_enabled" : [ true ], "id" : "136447843652214784" , "text" : [ "Morning San Francisco - 36 hours and counting.. #datasift" ], "source" : [ "<a href=\" http: //www.tweetdeck.com\ " rel=\" nofollow\ ">TweetDeck</a>" ], "created_at" : [ "Tue, 15 Nov 2011 14:17:55 +0000" ], "_version_" : 1481875073806631000 } Which I'd say is very reasonable behavior on Solr's part. +1 for commit
        Hide
        ASF subversion and git services added a comment -

        Commit 1631649 from Noble Paul in branch 'dev/trunk'
        [ https://svn.apache.org/r1631649 ]

        SOLR-6617

        Show
        ASF subversion and git services added a comment - Commit 1631649 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1631649 ] SOLR-6617
        Hide
        ASF subversion and git services added a comment -

        Commit 1631656 from Noble Paul in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1631656 ]

        SOLR-6617

        Show
        ASF subversion and git services added a comment - Commit 1631656 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1631656 ] SOLR-6617
        Hide
        Noble Paul added a comment -
        Show
        Noble Paul added a comment - thanks Timothy Potter
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Noble Paul
            Reporter:
            Timothy Potter
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development