Solr
  1. Solr
  2. SOLR-828

A RequestProcessor to support updates

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is same as SOLR-139. A new issue is opened so that the UpdateProcessor approach is highlighted and we can easily focus on that solution.

        Issue Links

          Activity

          Hide
          Noble Paul added a comment -

          The old approach is more work compared to the DB approach. It was not good for very fast updates/commits

          Show
          Noble Paul added a comment - The old approach is more work compared to the DB approach. It was not good for very fast updates/commits
          Hide
          Noble Paul added a comment - - edited

          The new UpdateProcessor called (UpdateableIndexProcessor) must be inserted before RunUpdateProcessor.

          • The UpdateProcessor must add an update method.
          • the AddUpdateCommand has a new boolean field append. If append= true multivalued fields will be appended else old ones are removed and new ones are added
          • The schema must have a <uniqueKey>
          • UpdateableIndexProcessor registers postCommit/postOptimize listeners.

          Implementation

          UpdateableIndexProcessor uses a DB (JDBC / Berkley DB java?) to store the data. Each document will be a row in the DB . The uniqueKey of the document will be used as the primary key. The data will be written as a BLOB into a DB column . The format will be javabin serialized format. The javabin format in the current form is inefficient but it is possible to enhance it (SOLR-810)

          The schema of the table would be
          ID : VARCHAR The primarykey of the document as string
          DATA : LONGVARBINARY : A javabin Serialized SolrInputDocument
          STATUS:ENUM (COMITTED = 0,UNCOMMITTED = 1,UNCOMMITTED_MARKED_FOR_DELETE = 2,COMMITTED_MARKED_FOR_DELETE = 3)
          BOOST:DOUBLE
          FIELD_BOOSTS:VARBINARY A javabin serialized data with boosts of each fields

          Implementation of various methods

          processAdd()

          UpdateableIndexProcessor writes the serialized document to the DB (COMMITTED=false) . Call next UpdateProcessor#add()

          processDelete()

          UpdateableIndexProcessor gets the Searcher from a core query and find the documents which matches the query and delete from the data table . If it is a delete by id delete the document with that id from data table. Call next UpdateProcessor

          processCommit()

          Call next UpdateProcessor

          on postCommit/postOmptimize

          UpdateableIndexProcessor gets all the documents from the data table which is committed =false. If the document is present in the main index it is marked as COMMITTED=true, else it is deleted because a deletebyquery would have deleted it .

          processUpdate()

          UpdateableIndexProcessor check the document first in data table. If it is present read the document . If it is not present , read all the missing fields from there, and the backup document is prepared

          The single valued fields are used from the incoming document (if present) others are filled from backup doc . If append=true all the multivalues values from backup document are added to the incoming document else the values from backup document is not used if they are present in incoming document also.

          processAdd() is called on the next UpdateProcessor

          new BackupIndexRequestHandler registered automatically at /backup

          This exposes the data present in the backup indexes. The user must be able to get any document by id by invoking /backup?id=<value> (multiple id values can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index and construct the new doc if he wishes to do so.

          Next steps

          The datastore can be optimized by not storing the stored fields in the DB. This means on postCommit/postOptimize we must read back the data and remove the already stored fields and store it back. That can be another iteration

          Show
          Noble Paul added a comment - - edited The new UpdateProcessor called ( UpdateableIndexProcessor ) must be inserted before RunUpdateProcessor . The UpdateProcessor must add an update method. the AddUpdateCommand has a new boolean field append. If append= true multivalued fields will be appended else old ones are removed and new ones are added The schema must have a <uniqueKey> UpdateableIndexProcessor registers postCommit/postOptimize listeners. Implementation UpdateableIndexProcessor uses a DB (JDBC / Berkley DB java?) to store the data. Each document will be a row in the DB . The uniqueKey of the document will be used as the primary key. The data will be written as a BLOB into a DB column . The format will be javabin serialized format. The javabin format in the current form is inefficient but it is possible to enhance it ( SOLR-810 ) The schema of the table would be ID : VARCHAR The primarykey of the document as string DATA : LONGVARBINARY : A javabin Serialized SolrInputDocument STATUS:ENUM (COMITTED = 0,UNCOMMITTED = 1,UNCOMMITTED_MARKED_FOR_DELETE = 2,COMMITTED_MARKED_FOR_DELETE = 3) BOOST:DOUBLE FIELD_BOOSTS:VARBINARY A javabin serialized data with boosts of each fields Implementation of various methods processAdd() UpdateableIndexProcessor writes the serialized document to the DB (COMMITTED=false) . Call next UpdateProcessor#add() processDelete() UpdateableIndexProcessor gets the Searcher from a core query and find the documents which matches the query and delete from the data table . If it is a delete by id delete the document with that id from data table. Call next UpdateProcessor processCommit() Call next UpdateProcessor on postCommit/postOmptimize UpdateableIndexProcessor gets all the documents from the data table which is committed =false. If the document is present in the main index it is marked as COMMITTED=true, else it is deleted because a deletebyquery would have deleted it . processUpdate() UpdateableIndexProcessor check the document first in data table. If it is present read the document . If it is not present , read all the missing fields from there, and the backup document is prepared The single valued fields are used from the incoming document (if present) others are filled from backup doc . If append=true all the multivalues values from backup document are added to the incoming document else the values from backup document is not used if they are present in incoming document also. processAdd() is called on the next UpdateProcessor new BackupIndexRequestHandler registered automatically at /backup This exposes the data present in the backup indexes. The user must be able to get any document by id by invoking /backup?id=<value> (multiple id values can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index and construct the new doc if he wishes to do so. Next steps The datastore can be optimized by not storing the stored fields in the DB. This means on postCommit/postOptimize we must read back the data and remove the already stored fields and store it back. That can be another iteration
          Hide
          Noble Paul added a comment -

          A lot of useful comments on the mail thread

          http://markmail.org/message/57dpsbz3z6dam7q7

          Show
          Noble Paul added a comment - A lot of useful comments on the mail thread http://markmail.org/message/57dpsbz3z6dam7q7
          Hide
          Shalin Shekhar Mangar added a comment -

          Marking for 1.5

          Show
          Shalin Shekhar Mangar added a comment - Marking for 1.5
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Shalin Shekhar Mangar added a comment -

          I think this is redundant now that we have atomic updates via stored fields and transaction logs.

          Show
          Shalin Shekhar Mangar added a comment - I think this is redundant now that we have atomic updates via stored fields and transaction logs.

            People

            • Assignee:
              Unassigned
              Reporter:
              Noble Paul
            • Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development