Solr
  1. Solr
  2. SOLR-6633

let /update/json/docs store the source json as well

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:

      Description

      it is a common requirement to store the entire JSON as a field in Solr.

      we can have a extra param srcField=field_name to specify the field name

      the /update/json/docs is only useful when all the json fields are predefined or in schemaless mode.

      The better option would be to store the content in a store only field and index the data in another field in other modes

      the relevant section in solrconfig.xml

       <initParams path="/update/json/docs">
          <lst name="defaults">
            <!--this ensures that the entire json doc will be stored verbatim into one field-->
            <str name="srcField">_src</str>
            <!--This means a the uniqueKeyField will be extracted from the fields and
             all fields go into the 'df' field. In this config df is already configured to be 'text'
              -->
            <str name="mapUniqueKeyOnly">true</str>
             <str name="df">text</str>
          </lst>
      
        </initParams>
      
      1. SOLR-6633.patch
        25 kB
        Noble Paul
      2. SOLR-6633.patch
        19 kB
        Noble Paul

        Issue Links

          Activity

          Hide
          Noble Paul added a comment -

          added support in the default example schema

          Show
          Noble Paul added a comment - added support in the default example schema
          Hide
          ASF subversion and git services added a comment -

          Commit 1633390 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1633390 ]

          SOLR-6633

          Show
          ASF subversion and git services added a comment - Commit 1633390 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1633390 ] SOLR-6633
          Hide
          ASF subversion and git services added a comment -

          Commit 1633391 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1633391 ]

          SOLR-6633

          Show
          ASF subversion and git services added a comment - Commit 1633391 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1633391 ] SOLR-6633
          Hide
          ASF subversion and git services added a comment -

          Commit 1633392 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1633392 ]

          SOLR-6633

          Show
          ASF subversion and git services added a comment - Commit 1633392 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1633392 ] SOLR-6633
          Hide
          ASF subversion and git services added a comment -

          Commit 1633394 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1633394 ]

          SOLR-6633 changed package

          Show
          ASF subversion and git services added a comment - Commit 1633394 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1633394 ] SOLR-6633 changed package
          Hide
          ASF subversion and git services added a comment -

          Commit 1633398 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1633398 ]

          SOLR-6633 changed package

          Show
          ASF subversion and git services added a comment - Commit 1633398 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1633398 ] SOLR-6633 changed package
          Hide
          Steve Rowe added a comment -

          It should be documented that atomic updates over documents indexed using this feature will cause the source field to become out of sync with the rest of the doc.

          Show
          Steve Rowe added a comment - It should be documented that atomic updates over documents indexed using this feature will cause the source field to become out of sync with the rest of the doc.
          Hide
          Noble Paul added a comment -
          Show
          Noble Paul added a comment - sure Use account "steve_rowe" instead
          Hide
          Alexandre Rafalovitch added a comment -

          This is truly just storing original document, right? And only returning the whole thing as well?

          Because, in Elasticsearch, the _src field is actually used as source for several operations. For example, it is as a source for dynamic update as - by default - fields are not stored individually. And, I think, _src field also gets re-written/re-created on update, again because it is actually used as a source of truth.

          The second issue I wanted to raise is how this will interplay with UpdateRequestProcessors (ES does not really have those). I guess URPs will apply after the content of the field, so the actual fields may look quite different from what's in the _src.

          Finally, I am not clear on what this really means: all fields go into the 'df' . Do we mean, there is a magic copyField or something?

          I think we need a bit more specific use-case here, then just an implementation/configuration. Especially, since a similar-but-different implementation in Elasticsearch does not fully match Solr's setup.

          Show
          Alexandre Rafalovitch added a comment - This is truly just storing original document, right? And only returning the whole thing as well? Because, in Elasticsearch, the _src field is actually used as source for several operations. For example, it is as a source for dynamic update as - by default - fields are not stored individually. And, I think, _src field also gets re-written/re-created on update, again because it is actually used as a source of truth. The second issue I wanted to raise is how this will interplay with UpdateRequestProcessors (ES does not really have those). I guess URPs will apply after the content of the field, so the actual fields may look quite different from what's in the _src . Finally, I am not clear on what this really means: all fields go into the 'df' . Do we mean, there is a magic copyField or something? I think we need a bit more specific use-case here, then just an implementation/configuration. Especially, since a similar-but-different implementation in Elasticsearch does not fully match Solr's setup.
          Hide
          Noble Paul added a comment -

          Because, in Elasticsearch, the _src field is actually used as source for several operations..

          This feature is not the same. it is a feature of the /update/json/docs requesthandler . We can't do it like ES because , the same document can be updated using other commands as well

          Finally, I am not clear on what this really means: all fields go into the 'df' .

          Solr is "strongly typed" , so to say. So it means we can't just put the content somewhere for searching. because all components use "df" as the default search field this component chooses to piggyback on the same field. The user can configure any other field as 'df' here. The next problem we need to address is that of uniqueKey. The component must extract the uniquekey field from the json itself or it should create one. That is the purpose of "mapUniqueKeyOnly" param

          We are not trying to be ES here. The use case is this.
          User has a bunch of json documents. He needs to index the data without configuring anything in the schema. The search result has to return some stored fields. Because Solr is "strongly typed" we can't store them in individual fields . So we must store the whole thing in some field and it made sense to store it in json itself.

          Show
          Noble Paul added a comment - Because, in Elasticsearch, the _src field is actually used as source for several operations.. This feature is not the same. it is a feature of the /update/json/docs requesthandler . We can't do it like ES because , the same document can be updated using other commands as well Finally, I am not clear on what this really means: all fields go into the 'df' . Solr is "strongly typed" , so to say. So it means we can't just put the content somewhere for searching. because all components use "df" as the default search field this component chooses to piggyback on the same field. The user can configure any other field as 'df' here. The next problem we need to address is that of uniqueKey. The component must extract the uniquekey field from the json itself or it should create one. That is the purpose of "mapUniqueKeyOnly" param We are not trying to be ES here. The use case is this. User has a bunch of json documents. He needs to index the data without configuring anything in the schema. The search result has to return some stored fields. Because Solr is "strongly typed" we can't store them in individual fields . So we must store the whole thing in some field and it made sense to store it in json itself.
          Hide
          Alexandre Rafalovitch added a comment -

          Is this somehow superseding the behavior in SOLR-6304 and http://lucidworks.com/blog/indexing-custom-json-data/ ? I mean the field extraction code can already do ID mapping by specifying an appropriate path, right? And for 'df', would you need to specify it as a param (like in the example 4 in the article)?

          And I am still trying to wrap my head about the use case. I don't expect users not to want to configure anything. At least the dates would need to be parsed/detected. And, usually, after the initial dump, the users go back and start adding specific definitions field by field, type by type (and reindex). Is that part of this scenario as well?

          P.s. I know Solr cannot clone Elasticsearch. I was just making sure that we are not somehow missing Solr-specifics by assuming Elasticsearch like behavior. Perhaps having the field also called _all was what confused me.

          Show
          Alexandre Rafalovitch added a comment - Is this somehow superseding the behavior in SOLR-6304 and http://lucidworks.com/blog/indexing-custom-json-data/ ? I mean the field extraction code can already do ID mapping by specifying an appropriate path, right? And for 'df', would you need to specify it as a param (like in the example 4 in the article)? And I am still trying to wrap my head about the use case. I don't expect users not to want to configure anything . At least the dates would need to be parsed/detected. And, usually, after the initial dump, the users go back and start adding specific definitions field by field, type by type (and reindex). Is that part of this scenario as well? P.s. I know Solr cannot clone Elasticsearch. I was just making sure that we are not somehow missing Solr-specifics by assuming Elasticsearch like behavior. Perhaps having the field also called _all was what confused me.
          Hide
          Yonik Seeley added a comment -

          Finally, I am not clear on what this really means: all fields go into the 'df' . Do we mean, there is a magic copyField or something?

          I'm not clear on this either... my best guess is that it is like a copyField. And all values (but not keys) are copied into this field?

          I'm not quite clear on "mapUniqueKeyOnly" either... (what the "Only" refers to). I guess if it's false, then all the fields in JSON Object are mapped to Solr fields based on the key in the JSON?

          Oh, and when we have magic field names, the convention in Solr has been an underscore on both sides (or not at all).
          So can we use _src_ or src of _src please?

          Show
          Yonik Seeley added a comment - Finally, I am not clear on what this really means: all fields go into the 'df' . Do we mean, there is a magic copyField or something? I'm not clear on this either... my best guess is that it is like a copyField. And all values (but not keys) are copied into this field? I'm not quite clear on "mapUniqueKeyOnly" either... (what the "Only" refers to). I guess if it's false, then all the fields in JSON Object are mapped to Solr fields based on the key in the JSON? Oh, and when we have magic field names, the convention in Solr has been an underscore on both sides (or not at all). So can we use _src_ or src of _src please?
          Hide
          Noble Paul added a comment -

          I hope you are all clear about the functionality/usecase. If the API/configuration needs change please suggest .

          I'm not clear on this either... my best guess is that it is like a copyField. And all values (but not keys) are copied into this field?

          It is like a copyFIeld but without a src field.

          I'm not quite clear on "mapUniqueKeyOnly" either... (what the "Only" refers to).

          All the values are extracted and dumped into a field. But it ensures that a uniqueKey is created. They don't need to use this attribute at all . f=text:/**&f=uniqueKeyField:/unique-field-name should do the trick. Then , if the json does not have a value for uniqueKey it fails.

          Oh, and when we have magic field names, the convention in Solr has been an underscore on both sides

          _src is not a magic field . It is explicitly added to the schema and it is explicitly specified here as well

          Show
          Noble Paul added a comment - I hope you are all clear about the functionality/usecase. If the API/configuration needs change please suggest . I'm not clear on this either... my best guess is that it is like a copyField. And all values (but not keys) are copied into this field? It is like a copyFIeld but without a src field. I'm not quite clear on "mapUniqueKeyOnly" either... (what the "Only" refers to). All the values are extracted and dumped into a field. But it ensures that a uniqueKey is created. They don't need to use this attribute at all . f=text:/**&f=uniqueKeyField:/unique-field-name should do the trick. Then , if the json does not have a value for uniqueKey it fails. Oh, and when we have magic field names, the convention in Solr has been an underscore on both sides _src is not a magic field . It is explicitly added to the schema and it is explicitly specified here as well
          Hide
          Alexandre Rafalovitch added a comment -

          They don't need to use this attribute at all . f=text:/**&f=uniqueKeyField:/unique-field-name should do the trick.

          So, the advantage of not using the parameter syntax above is that it will automatically figure out what the uniqueKeyField is from the schema? Similar to the UUID URP?

          But what happens if somebody specifies both. Do we get double content in text? Can we also use the params to populate other fields anyway (I guess yes).

          And what happens if original JSON is super fat, can we specify exclusion rules. I bet this will be asked too. Don't have to implement it, but will it fit into the current model?

          I like the feature, I am just trying to make sure it does not cause the confusion through multiplication of options. In my own mind, when I was thinking about this use case (store original JSON), I imagined an URP that just pulls the original JSON from the request. Again, similar to UUID URP one can add into the chain.

          Show
          Alexandre Rafalovitch added a comment - They don't need to use this attribute at all . f=text:/**&f=uniqueKeyField:/unique-field-name should do the trick. So, the advantage of not using the parameter syntax above is that it will automatically figure out what the uniqueKeyField is from the schema? Similar to the UUID URP? But what happens if somebody specifies both. Do we get double content in text? Can we also use the params to populate other fields anyway (I guess yes). And what happens if original JSON is super fat, can we specify exclusion rules. I bet this will be asked too. Don't have to implement it, but will it fit into the current model? I like the feature, I am just trying to make sure it does not cause the confusion through multiplication of options. In my own mind, when I was thinking about this use case (store original JSON), I imagined an URP that just pulls the original JSON from the request. Again, similar to UUID URP one can add into the chain.
          Hide
          Noble Paul added a comment -

          So, the advantage of not using the parameter syntax above is that it will automatically figure out what the uniqueKeyField is from the schema? Similar to the UUID URP?

          yes and no. I want this to work seamlessly even if uniqueKey is changed without mucking up with solrconfig.xml . I also want it to just work when there is no uniqueKey present in json. Basically, out of the box, it should just work for any json. I hate to tell newbies that they need to edit solrconfig.xml to just get anything working

          I would recommend this only if you are a newbie . I should document in place to do the explicit mappings with wildcards .

          And what happens if original JSON is super fat, can we specify exclusion rules.

          No, there are only inclusion rules. but the sytax is quite powerful to achieve that

          Again, similar to UUID URP one can add into the chain.

          URP just fails the simplicity test. It is extremely hard for even experts to get their head around.
          I HATE the fact that we recommend hard to do configuration to everyone. If we want to get the first time users on board we will need to stop all that. First time users just need stuff to work.

          Show
          Noble Paul added a comment - So, the advantage of not using the parameter syntax above is that it will automatically figure out what the uniqueKeyField is from the schema? Similar to the UUID URP? yes and no. I want this to work seamlessly even if uniqueKey is changed without mucking up with solrconfig.xml . I also want it to just work when there is no uniqueKey present in json. Basically, out of the box, it should just work for any json. I hate to tell newbies that they need to edit solrconfig.xml to just get anything working I would recommend this only if you are a newbie . I should document in place to do the explicit mappings with wildcards . And what happens if original JSON is super fat, can we specify exclusion rules. No, there are only inclusion rules. but the sytax is quite powerful to achieve that Again, similar to UUID URP one can add into the chain. URP just fails the simplicity test. It is extremely hard for even experts to get their head around. I HATE the fact that we recommend hard to do configuration to everyone. If we want to get the first time users on board we will need to stop all that. First time users just need stuff to work.
          Hide
          Alexandre Rafalovitch added a comment -

          We agree completely on the newbie message. I am just trying to make sure it is clear how it fits into the rest of Solr without creating a jarring jump between the step 1 and step 2.

          So, to be clear. This covers step 1. Then, for step 2 (e.g. and now to handle dates....) this connects smoothly to what? To a f=dateField:/xyz and schemaless mode? To an explicit creation of a date field/type in an Admin UI?

          Show
          Alexandre Rafalovitch added a comment - We agree completely on the newbie message. I am just trying to make sure it is clear how it fits into the rest of Solr without creating a jarring jump between the step 1 and step 2. So, to be clear. This covers step 1. Then, for step 2 (e.g. and now to handle dates.... ) this connects smoothly to what? To a f=dateField:/xyz and schemaless mode? To an explicit creation of a date field/type in an Admin UI?
          Hide
          Noble Paul added a comment -

          But what happens if somebody specifies both. Do we get double content in text?

          No, if mapUniqueKeyOnly overrides other field definitions

          Show
          Noble Paul added a comment - But what happens if somebody specifies both. Do we get double content in text? No, if mapUniqueKeyOnly overrides other field definitions
          Hide
          Noble Paul added a comment -

          for step 2 (e.g. and now to handle dates....) this connects smoothly to what? To a f=dateField:/xyz and schemaless mode? To an explicit creation of a date field/type in an Admin UI?

          This is not within the scope of this feature. Actually the objective of this was to introduce the srcField only. Then I realized that it needed to do more to achieve the objective

          Show
          Noble Paul added a comment - for step 2 (e.g. and now to handle dates....) this connects smoothly to what? To a f=dateField:/xyz and schemaless mode? To an explicit creation of a date field/type in an Admin UI? This is not within the scope of this feature. Actually the objective of this was to introduce the srcField only. Then I realized that it needed to do more to achieve the objective
          Hide
          Yonik Seeley added a comment -

          _src is not a magic field . It is explicitly added to the schema and it is explicitly specified here as well

          My point was that we currently have field/param names with both leading and trailing underscores, and field names without. We should probably stick with that unless we can come up with a common meaning for leading-underscore-only.

          And it's still partially magic (even though it's configurable)... it's another field that the user did not explicitly ask for (and hence I assume the underscore was a way to avoid clashing with actual user fields).

          Show
          Yonik Seeley added a comment - _src is not a magic field . It is explicitly added to the schema and it is explicitly specified here as well My point was that we currently have field/param names with both leading and trailing underscores, and field names without. We should probably stick with that unless we can come up with a common meaning for leading-underscore-only. And it's still partially magic (even though it's configurable)... it's another field that the user did not explicitly ask for (and hence I assume the underscore was a way to avoid clashing with actual user fields).
          Hide
          Alexandre Rafalovitch added a comment -

          +1 on "it's still partially magic"

          Show
          Alexandre Rafalovitch added a comment - +1 on "it's still partially magic"
          Hide
          Noble Paul added a comment -

          OK. let's rename the field to _src_

          Show
          Noble Paul added a comment - OK. let's rename the field to _src_
          Hide
          ASF subversion and git services added a comment - - edited

          Commit 1644100 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1644100 ]

          SOLR-6633 field name changed from _src to _src_ by popular demand

          Show
          ASF subversion and git services added a comment - - edited Commit 1644100 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1644100 ] SOLR-6633 field name changed from _src to _src_ by popular demand
          Hide
          ASF subversion and git services added a comment - - edited

          Commit 1644103 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1644103 ]

          SOLR-6633 field name changed from _src to _src_ by popular demand

          Show
          ASF subversion and git services added a comment - - edited Commit 1644103 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1644103 ] SOLR-6633 field name changed from _src to _src_ by popular demand
          Hide
          ASF subversion and git services added a comment -

          Commit 1644135 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1644135 ]

          SOLR-6633 changing the field name from _src to _src_

          Show
          ASF subversion and git services added a comment - Commit 1644135 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1644135 ] SOLR-6633 changing the field name from _src to _src_
          Hide
          ASF subversion and git services added a comment -

          Commit 1644136 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1644136 ]

          SOLR-6633 changing the field name from _src to _src_

          Show
          ASF subversion and git services added a comment - Commit 1644136 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1644136 ] SOLR-6633 changing the field name from _src to _src_
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Noble Paul
              Reporter:
              Noble Paul
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development