Solr
  1. Solr
  2. SOLR-3173

Database semantics - insert and update

    Details

      Description

      In order increase the ability of Solr to be used as a NoSql database (lots of concurrent inserts, updates, deletes and queries in the entire lifetime of the index) instead of just a search index (first: everything indexed (in one thread), after: only queries), I would like Solr to support the following features inspired by RDBMSs and other NoSql databases.

      • Given a solr-core with a schema containing a uniqueKey-field "uniqueField" and a document Dold, when trying to INSERT a new document Dnew where Dold.uniqueField is equal to Dnew.uniqueField, then I want a DocumentAlredyExists error. If no such document Dold exists I want Dnew indexed into the solr-core.
      • Given a solr-core with a schema containing a uniqueKey-field "uniqueField" and a document Dold, when trying to UPDATE a document Dnew where Dold.uniqueField is equal to Dnew.uniqueField I want Dold deleted from and Dnew added to the index (just as it is today).If no such document Dold exists I want nothing to happen (Dnew is not added to the index)

      The essence of this issue is to be able to state your intent (insert or update) and have slightly different semantics (from each other and the existing update) depending on you intent.

      The functionality provided by this issue is only really meaningfull when you run with "updateLog" activated.

      This issue might be solved more or less at the same time as SOLR-3178, and only one single SVN patch might be given to cover both issues.

        Activity

        Hide
        Per Steffensen added a comment - - edited

        Impl thoughts:

        • In solrconfig.xml turn this feature on by adding a tag <databaseSemantics> (probably to <updateHandler>). When this "flag" in not turned on "update"-requests work as always (add documents if not exist). With this "flag" turned on "update"-requests do not create a document if it does not already exist. Also when this "flag" is turned on "insert"-requests are suddenly also possible - they do as "update"-requets do today, except that they return DocumentAlreadyExist-error if document already exists.
        • Proper concurrency handling
        • Be carefull using this feature unless you are running with updateLog turned on (and therefore will never "lose" already accepted updates/deletes on crash)
        Show
        Per Steffensen added a comment - - edited Impl thoughts: In solrconfig.xml turn this feature on by adding a tag <databaseSemantics> (probably to <updateHandler>). When this "flag" in not turned on "update"-requests work as always (add documents if not exist). With this "flag" turned on "update"-requests do not create a document if it does not already exist. Also when this "flag" is turned on "insert"-requests are suddenly also possible - they do as "update"-requets do today, except that they return DocumentAlreadyExist-error if document already exists. Proper concurrency handling Be carefull using this feature unless you are running with updateLog turned on (and therefore will never "lose" already accepted updates/deletes on crash)
        Hide
        Per Steffensen added a comment -

        See thread "Unique key constraint and optimistic locking (versioning)" on solr-user mailing list (started 21.02.2012)

        Show
        Per Steffensen added a comment - See thread "Unique key constraint and optimistic locking (versioning)" on solr-user mailing list (started 21.02.2012)
        Hide
        Per Steffensen added a comment -

        Want to put myself as "Assignee". How to?

        Show
        Per Steffensen added a comment - Want to put myself as "Assignee". How to?
        Hide
        Steve Rowe added a comment -

        Hi Per, I've added you to the JIRA Solr contributor group, which enables you to be assigned to JIRA issues.

        Show
        Steve Rowe added a comment - Hi Per, I've added you to the JIRA Solr contributor group, which enables you to be assigned to JIRA issues.
        Hide
        Yonik Seeley added a comment -

        Here's an idea: we already have a version field for documents (that can also be passed in the URL for other things like deletes), we simply reuse that. Positive versions are adds, negative versions are deletes.

        If a document comes into the shad leader and already has a version, then it's considered an optimistic concurrency request... the document should be replacing an existing document with exactly that version. If the version passed is negative, then the document should not already exist (all deleted documents are considered equal).

        No new config needed, and optimistic concurrency can be selected on a per-request basis to the same handler.

        Show
        Yonik Seeley added a comment - Here's an idea: we already have a version field for documents (that can also be passed in the URL for other things like deletes), we simply reuse that. Positive versions are adds, negative versions are deletes. If a document comes into the shad leader and already has a version , then it's considered an optimistic concurrency request... the document should be replacing an existing document with exactly that version. If the version passed is negative, then the document should not already exist (all deleted documents are considered equal). No new config needed, and optimistic concurrency can be selected on a per-request basis to the same handler.
        Hide
        Per Steffensen added a comment -

        Thanks for great input Yonik Seeley. I belive you are already commenting on the related Jira issue that I havnt created yet This issue SOLR-3173 is not so much about versioning but basically only about being able to state your intent (insert or update) on Solr updates, and support failing on "insert"-intent if document already exists, and doing nothing on "update"-intent if document does not already exist.

        I want to reply a little to you comment here anyway:
        I fell over the version field, and wanted to ask about what it is used for. Can you point me in the direction of a (Wiki) description explaining more exactly what it is used for and how. Or else I will need to read the code, in order to make sure that I agree that we can just use that version number - or to what degree we can use it.

        I am not sure I ALWAYS like optimistic locking on a per-request basis. Then it is "too much" up to the clients to use it "correctly". So it depends on how much control you have over your clients. I have different thoughts on this area though. Those thoughts will be reflected soon in a change of the description of this issue and the upcomming related issue (the one actually dealing with versions). Please be patient.

        Providing me more info about the version would be greatly appreciated now, though

        Show
        Per Steffensen added a comment - Thanks for great input Yonik Seeley. I belive you are already commenting on the related Jira issue that I havnt created yet This issue SOLR-3173 is not so much about versioning but basically only about being able to state your intent (insert or update) on Solr updates, and support failing on "insert"-intent if document already exists, and doing nothing on "update"-intent if document does not already exist. I want to reply a little to you comment here anyway: I fell over the version field, and wanted to ask about what it is used for. Can you point me in the direction of a (Wiki) description explaining more exactly what it is used for and how. Or else I will need to read the code, in order to make sure that I agree that we can just use that version number - or to what degree we can use it. I am not sure I ALWAYS like optimistic locking on a per-request basis. Then it is "too much" up to the clients to use it "correctly". So it depends on how much control you have over your clients. I have different thoughts on this area though. Those thoughts will be reflected soon in a change of the description of this issue and the upcomming related issue (the one actually dealing with versions). Please be patient. Providing me more info about the version would be greatly appreciated now, though
        Hide
        Yonik Seeley added a comment -

        I belive you are already commenting on the related Jira issue that I havnt created yet

        Optimistic locking as a superset to insert/update:

        What I already had in mind:

        • update only a specific version of the document by specifying it's exact version: version=12345
        • add a document only if it doesn't already exist (i.e. insert): version=-1
        • add a document regardless: don't specify a version

        So now that I look at it again, it looks like what's missing is your "UPDATE" semantics which would only replace the record if it already existed (a weaker form of the first case... any positive version is OK). But I really wonder how useful those semantics are (only add a doc if it's overwriting an existing doc, regardless of what version or what data it contains?)

        If there are usecases, we certainly should be able to do it.

        As far as what _version_ is, it's new and used for solrcloud to handle reorders of updates to replicas (among other things).
        The leader shard decides what the version of a document should be (versions only increase), and forwards the doc with the version to the replicas.
        If a replica receives the same doc with a lower version, it knows that it can safely drop it because it already has a newer version.

        Show
        Yonik Seeley added a comment - I belive you are already commenting on the related Jira issue that I havnt created yet Optimistic locking as a superset to insert/update: What I already had in mind: update only a specific version of the document by specifying it's exact version: version =12345 add a document only if it doesn't already exist (i.e. insert): version =-1 add a document regardless: don't specify a version So now that I look at it again, it looks like what's missing is your "UPDATE" semantics which would only replace the record if it already existed (a weaker form of the first case... any positive version is OK). But I really wonder how useful those semantics are (only add a doc if it's overwriting an existing doc, regardless of what version or what data it contains?) If there are usecases, we certainly should be able to do it. As far as what _version_ is, it's new and used for solrcloud to handle reorders of updates to replicas (among other things). The leader shard decides what the version of a document should be (versions only increase), and forwards the doc with the version to the replicas. If a replica receives the same doc with a lower version, it knows that it can safely drop it because it already has a newer version.
        Hide
        Per Steffensen added a comment - - edited

        Optimistic locking as a superset to insert/update:

        What I already had in mind:

        - update only a specific version of the document by specifying it's exact version: version=12345

        - add a document only if it doesn't already exist (i.e. insert): version=-1

        - add a document regardless: don't specify a version

        I still need a little time to evaluate to what extend version can be used.

        So now that I look at it again, it looks like what's missing is your "UPDATE" semantics which would only replace the record if it already existed (a weaker form of the first case... any positive version is OK). But I really wonder how useful those semantics are (only add a doc if it's overwriting an existing doc, regardless of what version or what data it contains?)

        If there are usecases, we certainly should be able to do it.

        The only-insert-if-not-exists is needed by us. The only-update-if-exists is mostly for consistency with what we know from RDBMS. Basically simulating what happens when you do the following in SQL and you have unique-constraint on id column. 1) will fail with a unique-key constraint error if document already exists and 2) will not create the row/doc if it does not already exist.
        1) INSERT INTO docs (id, column2, column3,...) VALUES (id-value, value2, value3,...)
        and
        2) UPDATE docs SET column2=value2, column3=value3, ... WHERE id=id-value
        RDBMS people are used to a update operation that does no create a row/document if it has already been deleted. I will consider not making that feature - it is only there to give a consistent experince compared to what you are used to using RDBMS's, and actually seen from a distant perspective I think it is not logical with an "update"-operation that creates stuff if it does not exist (it is simple not logical from the word "update")

        Right now I believe the solution will be that you will have the following URL-extentions
        a) .../solr/.../update, the one already existing in Solr with unchanged semantics
        b) .../solr/.../database/update, that updates if document already exists and does nothing if it does not already exists. And when versioning is activated (SOLR-3178) only updates if correct version is given - give VersionConflict error if document exists but version is not correct.
        c) .../solr/.../database/insert, that creates a new document if document does not already exist. Fails with DocumentAlreadyExists error if document already exists.
        The you can keep using Solr exactly as you are used to, and you can start using the new "database semantics" features if you want that. I might create a optinal config for DirectUpdateHandler2 where you can deactivate the stuff behind a). This can be used when you dont trust clients to use a) correctly in a setup where you want to ensure consistency under high concurrent load.

        As far as what _version_ is, it's new and used for solrcloud to handle reorders of updates to replicas (among other things).

        The leader shard decides what the version of a document should be (versions only increase), and forwards the doc with the version to the replicas.

        If a replica receives the same doc with a lower version, it knows that it can safely drop it because it already has a newer version.

        Cool. I understand a little better now. So no (Wiki) documentation written yet?

        Show
        Per Steffensen added a comment - - edited Optimistic locking as a superset to insert/update: What I already had in mind: - update only a specific version of the document by specifying it's exact version: version =12345 - add a document only if it doesn't already exist (i.e. insert): version =-1 - add a document regardless: don't specify a version I still need a little time to evaluate to what extend version can be used. So now that I look at it again, it looks like what's missing is your "UPDATE" semantics which would only replace the record if it already existed (a weaker form of the first case... any positive version is OK). But I really wonder how useful those semantics are (only add a doc if it's overwriting an existing doc, regardless of what version or what data it contains?) If there are usecases, we certainly should be able to do it. The only-insert-if-not-exists is needed by us. The only-update-if-exists is mostly for consistency with what we know from RDBMS. Basically simulating what happens when you do the following in SQL and you have unique-constraint on id column. 1) will fail with a unique-key constraint error if document already exists and 2) will not create the row/doc if it does not already exist. 1) INSERT INTO docs (id, column2, column3,...) VALUES (id-value, value2, value3,...) and 2) UPDATE docs SET column2=value2, column3=value3, ... WHERE id=id-value RDBMS people are used to a update operation that does no create a row/document if it has already been deleted. I will consider not making that feature - it is only there to give a consistent experince compared to what you are used to using RDBMS's, and actually seen from a distant perspective I think it is not logical with an "update"-operation that creates stuff if it does not exist (it is simple not logical from the word "update") Right now I believe the solution will be that you will have the following URL-extentions a) .../solr/.../update, the one already existing in Solr with unchanged semantics b) .../solr/.../database/update, that updates if document already exists and does nothing if it does not already exists. And when versioning is activated ( SOLR-3178 ) only updates if correct version is given - give VersionConflict error if document exists but version is not correct. c) .../solr/.../database/insert, that creates a new document if document does not already exist. Fails with DocumentAlreadyExists error if document already exists. The you can keep using Solr exactly as you are used to, and you can start using the new "database semantics" features if you want that. I might create a optinal config for DirectUpdateHandler2 where you can deactivate the stuff behind a). This can be used when you dont trust clients to use a) correctly in a setup where you want to ensure consistency under high concurrent load. As far as what _version_ is, it's new and used for solrcloud to handle reorders of updates to replicas (among other things). The leader shard decides what the version of a document should be (versions only increase), and forwards the doc with the version to the replicas. If a replica receives the same doc with a lower version, it knows that it can safely drop it because it already has a newer version. Cool. I understand a little better now. So no (Wiki) documentation written yet?
        Hide
        Per Steffensen added a comment -

        Believe we will be able to use version if:
        a) There is a "realtime" way of getting the version corresponding to a given id (or whatever you use as uniqueKey). Lets call this getRealtimeVersion(id)
        b) The version for a given id returned by getRealtimeVersion(id) never changes unless changes has been made to the document with that id (created, updated or deleted)
        c) That getRealtimeVersion(id) will immediately return that new version as soon a change has been made - no soft- or hard-commit necessary. Well that is the realtime part
        d) I will always get a negative number (hopefully always -1) from getRealtimeVersion(id) when calling with an id, where there is no corresponding document in the solr-core. No matter if there has never been such a document or if it has been there but has been deleted.

        Can you please confirm or correct me on the above bullets, Yonik. It would also be very helpfull if you would provide the code for getRealtimeVersion(id), assuming that I am in the DirectUpdateHandler2. Thanks alot!

        Guess this version-checking stuff is only necessary on primary (or master or whatever you call it) shards and not on replica (or slave). How do I know in DirectUpdateHandler2 if I am primary/master- or replica/slave-shard?

        Regret a little bit the idea about different URLs stated in comment above. Guess I would just like to state info about the wanted semantics in the query in some other way. I guess it would be nice with a "semantics" URL-param with the possible values "db-insert", "db-update", "db-update-version-checked" and "classic-solr-update":

        • semantics=db-insert: Index document doc if and only if getRealtimeVersion(doc.id) returns -1. Else return DocumentAlreadyExist error
        • semantics=db-update: Replace existing document if it exists, else return DocumentDoesNotExist error
        • semantics=db-update-version-checked: As db-update but if version on the provided document does not correspond to existing getRealtimeVersion(doc.id) return VersionConflict error
        • semantics=classic-solr-update: Do exactly as update does today in Solr
          "classic-solr-update" will be used if "semantics" is not specified in update request - it is the default. In solrconfig.xml you will be able to change default semantics plus provide a list of semantics that are not allowed.

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - Believe we will be able to use version if: a) There is a "realtime" way of getting the version corresponding to a given id (or whatever you use as uniqueKey). Lets call this getRealtimeVersion(id) b) The version for a given id returned by getRealtimeVersion(id) never changes unless changes has been made to the document with that id (created, updated or deleted) c) That getRealtimeVersion(id) will immediately return that new version as soon a change has been made - no soft- or hard-commit necessary. Well that is the realtime part d) I will always get a negative number (hopefully always -1) from getRealtimeVersion(id) when calling with an id, where there is no corresponding document in the solr-core. No matter if there has never been such a document or if it has been there but has been deleted. Can you please confirm or correct me on the above bullets, Yonik. It would also be very helpfull if you would provide the code for getRealtimeVersion(id), assuming that I am in the DirectUpdateHandler2. Thanks alot! Guess this version-checking stuff is only necessary on primary (or master or whatever you call it) shards and not on replica (or slave). How do I know in DirectUpdateHandler2 if I am primary/master- or replica/slave-shard? Regret a little bit the idea about different URLs stated in comment above. Guess I would just like to state info about the wanted semantics in the query in some other way. I guess it would be nice with a "semantics" URL-param with the possible values "db-insert", "db-update", "db-update-version-checked" and "classic-solr-update": semantics=db-insert: Index document doc if and only if getRealtimeVersion(doc.id) returns -1. Else return DocumentAlreadyExist error semantics=db-update: Replace existing document if it exists, else return DocumentDoesNotExist error semantics=db-update-version-checked: As db-update but if version on the provided document does not correspond to existing getRealtimeVersion(doc.id) return VersionConflict error semantics=classic-solr-update: Do exactly as update does today in Solr "classic-solr-update" will be used if "semantics" is not specified in update request - it is the default. In solrconfig.xml you will be able to change default semantics plus provide a list of semantics that are not allowed. Regards, Per Steffensen
        Hide
        Per Steffensen added a comment -

        Hi again

        Have most of it coded by now. No tests yet though. Simply wasnt able to create tests before I knew what kind changes to do and where to do them.

        Yonik Seeley, I really hope for your help on the best way (quickest), from inside DirectUpdateHandler2 (e.g. addDoc method), to realtime-get the newest version number of the document in cmd (using its idField) - basically getRealtimeVersion(id) in comment above. Please help with some code or a few hints.
        I would also really like you to confirm or correct me on a) through d) in comment above.

        I have been playing a little with some code to get the newest document with same uniqueKey (idField) value as the document in cmd:
        SearchComponent realTimeGetComponent = core.getSearchComponent(RealTimeGetComponent.COMPONENT_NAME);
        SolrQueryRequest req = new SolrQueryRequestBase(core, ???) {};
        ResponseBuilder rb = ???;
        realTimeGetComponent.prepare(rb);
        realTimeGetComponent.process(rb);
        long currentVersion = rb.???;
        But I am in doubt if there is a more direct way than getting the SearchComponent from the core and use that? And exactly what to put in as SolrParams? How to create a ResponseBuilder from req? If I am allowed to just call prepare and process on a SearchComponent, or if that has to be handled by some framework that does more? How to get the currentVersion from rb after processing the query? In general about exactly how to do?
        I will make it work, but you can really save me some time, and help me get i right and as efficient as possible in the first go.
        Thanks in advance!

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - Hi again Have most of it coded by now. No tests yet though. Simply wasnt able to create tests before I knew what kind changes to do and where to do them. Yonik Seeley, I really hope for your help on the best way (quickest), from inside DirectUpdateHandler2 (e.g. addDoc method), to realtime-get the newest version number of the document in cmd (using its idField) - basically getRealtimeVersion(id) in comment above. Please help with some code or a few hints. I would also really like you to confirm or correct me on a) through d) in comment above. I have been playing a little with some code to get the newest document with same uniqueKey (idField) value as the document in cmd: SearchComponent realTimeGetComponent = core.getSearchComponent(RealTimeGetComponent.COMPONENT_NAME); SolrQueryRequest req = new SolrQueryRequestBase(core, ???) {}; ResponseBuilder rb = ???; realTimeGetComponent.prepare(rb); realTimeGetComponent.process(rb); long currentVersion = rb.???; But I am in doubt if there is a more direct way than getting the SearchComponent from the core and use that? And exactly what to put in as SolrParams? How to create a ResponseBuilder from req? If I am allowed to just call prepare and process on a SearchComponent, or if that has to be handled by some framework that does more? How to get the currentVersion from rb after processing the query? In general about exactly how to do? I will make it work, but you can really save me some time, and help me get i right and as efficient as possible in the first go. Thanks in advance! Regards, Per Steffensen
        Hide
        Per Steffensen added a comment -

        More detailed descriptions of the added features in SOLR-3173 and SOLR-3178 here: http://wiki.apache.org/solr/Per%20Steffensen/Update%20semantics

        Show
        Per Steffensen added a comment - More detailed descriptions of the added features in SOLR-3173 and SOLR-3178 here: http://wiki.apache.org/solr/Per%20Steffensen/Update%20semantics
        Hide
        Per Steffensen added a comment -

        See patch attached to SOLR-3178

        Show
        Per Steffensen added a comment - See patch attached to SOLR-3178
        Hide
        Per Steffensen added a comment -

        New patch available as part of SOLR-3173_3178_3382_3428_plus.patch on SOLR-3178

        Show
        Per Steffensen added a comment - New patch available as part of SOLR-3173 _3178_3382_3428_plus.patch on SOLR-3178
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.

          People

          • Assignee:
            Per Steffensen
            Reporter:
            Per Steffensen
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 168h
              168h
              Remaining:
              Remaining Estimate - 168h
              168h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development