Solr
  1. Solr
  2. SOLR-4661

Index Version & Gen look out of sync in replication UI when master searcher uses older commit then what is replicatable

    Details

      Description

      the ReplicationHandler (and the replication admin UI screen) report the index version & gen for the master based on what commit point is currently open for searching – but this is not necessarily the most recent commit point available for replication.

      Thus, it can appear that a slave has "gotten ahead" of the master, if there are "empty commits" (because of reader reopening shotcuts) or commits using openSearcher=false.

      We need to add additional data to help make it clear there is no actual problem in this sitation.

      Summary of original bug report..

      Index and Gen number on Slave is higher than master.
      If you apply commit on master with no pending docs then the commit time stamp and gen is incremented. When Slaves polls master for replication it see the index version difference and starts replicating but all files are skipped.

      On Admin UI (on Slaves) the version number displayed for master is old where as for slave is the latest which is higher than master.
      Below is the response from master (/replication?command=details) where i see two different Version an Gen numbers. This creates confusion of having version out of sync, though its not.

      ...

      1. hoss_test.zip
        342 kB
        Hoss Man
      2. IndexVersionSyncIssue.jpg
        41 kB
        Aditya
      3. SOLR-4661.patch
        14 kB
        Stefan Matheis (steffkes)
      4. SOLR-4661.patch
        9 kB
        Hoss Man
      5. SOLR-4661.patch
        8 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Aditya added a comment -

          Index and Gen version displayed on Admin UI (Slave)

          Show
          Aditya added a comment - Index and Gen version displayed on Admin UI (Slave)
          Hide
          Mark Miller added a comment -

          Can you post your replication handler config?

          Show
          Mark Miller added a comment - Can you post your replication handler config?
          Hide
          Aditya added a comment -

          <requestHandler name="/replication" class="solr.ReplicationHandler" >
          <lst name="master">
          <str name="enable">$

          {enable.master:false}

          </str>
          <str name="replicateAfter">commit</str>
          <str name="replicateAfter">startup</str>
          <str name="confFiles">schema.xml</str>
          </lst>
          <lst name="slave">
          <str name="enable">$

          {enable.slave:false}

          </str>
          <str name="masterUrl">http://XX.XX.XX.XX/solr</str>
          <str name="pollInterval">00:05:00</str>
          </lst>
          </requestHandler>

          Show
          Aditya added a comment - <requestHandler name="/replication" class="solr.ReplicationHandler" > <lst name="master"> <str name="enable">$ {enable.master:false} </str> <str name="replicateAfter">commit</str> <str name="replicateAfter">startup</str> <str name="confFiles">schema.xml</str> </lst> <lst name="slave"> <str name="enable">$ {enable.slave:false} </str> <str name="masterUrl"> http://XX.XX.XX.XX/solr </str> <str name="pollInterval">00:05:00</str> </lst> </requestHandler>
          Hide
          Erick Erickson added a comment -

          Is there any chance you started up the slave with the wrong environment variable so it thinks it's a master?

          Wild shot in the dark.

          Show
          Erick Erickson added a comment - Is there any chance you started up the slave with the wrong environment variable so it thinks it's a master? Wild shot in the dark.
          Hide
          Aditya added a comment - - edited

          Double checked .. and everything looks good. Step to reproduce this issue is simple. Just apply commit on master without posting any documents. The next replication you will see the difference of Version and gen on Slave. I remember in Solr3.5 applying commit with no pending document will not increment the version and gen on master.

          From the Slave -----------
          JVM
          Runtime : Oracle Corporation Java HotSpot(TM) 64-Bit Server VM (1.7.0_09 23.5-b02)
          Processors : 4
          Args:
          -Dlogging.configuration=file:/opt/jboss/standalone/configuration/logging.properties
          -Dorg.jboss.boot.log.file=/opt/jboss/standalone/log/boot.log
          -XX:+UseG1GC
          -XX:+UnlockExperimentalVMOptions
          -XX:MaxPermSize=256m
          -Xmx3g
          -Xms3g
          -Djboss.server.default.config=standalone.xml
          -Djava.awt.headless=true
          -Djboss.modules.system.pkgs=org.jboss.byteman
          -Dsolr.data.dir=/storage/solrdata/
          -Denable.slave=true
          -Denable.master=false
          -Dsolr.solr.home=/opt/solr
          -Dorg.jboss.resolver.warning=true
          -Djava.net.preferIPv4Stack=true
          -XX:+TieredCompilation
          -XX:+UseCompressedOops
          -D[Standalone]

          Show
          Aditya added a comment - - edited Double checked .. and everything looks good. Step to reproduce this issue is simple. Just apply commit on master without posting any documents. The next replication you will see the difference of Version and gen on Slave. I remember in Solr3.5 applying commit with no pending document will not increment the version and gen on master. From the Slave ----------- JVM Runtime : Oracle Corporation Java HotSpot(TM) 64-Bit Server VM (1.7.0_09 23.5-b02) Processors : 4 Args: -Dlogging.configuration= file:/opt/jboss/standalone/configuration/logging.properties -Dorg.jboss.boot.log.file=/opt/jboss/standalone/log/boot.log -XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:MaxPermSize=256m -Xmx3g -Xms3g -Djboss.server.default.config=standalone.xml -Djava.awt.headless=true -Djboss.modules.system.pkgs=org.jboss.byteman -Dsolr.data.dir=/storage/solrdata/ -Denable.slave=true -Denable.master=false -Dsolr.solr.home=/opt/solr -Dorg.jboss.resolver.warning=true -Djava.net.preferIPv4Stack=true -XX:+TieredCompilation -XX:+UseCompressedOops -D [Standalone]
          Hide
          Hoss Man added a comment -

          Aditya: in your thread on the solr-user@lucene list, you posted some other examples of the generation being higher on the slave then on the master, and you also posted some short log files from both your master & slave, but the log files you posted don't seem to directly correspond to the snippets you posted from requesting /replication?command=details in any of your mails or in this issue (the generation numbers logged by SnapPuller on the slave don't match up)

          Can you please attach a zip file containing a single cohesive example of:

          • a master solrconfig.xml
          • a slave solrconfig.xml (if different)
          • the log files for both the master and slave from JVM startup to jvm shutdown
          • any number of files containing the full response of getting the replication details form the master & slave with the timestamp of when these files were generated.

          The timestamps being crucial to helping us stitch together where in the flow of log messages you are seeing this inconsistency.

          If you are using bash, running two commands like this (one for master and one for slave) while solr is running should be suitable...

          while true; do date --utc && curl -sS "http://localhost:8983/solr/collection1/replication?command=details&indent=truewt=json" && echo && echo && sleep 2; done 2>&1 > slave_rep_details.txt
          
          Show
          Hoss Man added a comment - Aditya: in your thread on the solr-user@lucene list, you posted some other examples of the generation being higher on the slave then on the master, and you also posted some short log files from both your master & slave, but the log files you posted don't seem to directly correspond to the snippets you posted from requesting /replication?command=details in any of your mails or in this issue (the generation numbers logged by SnapPuller on the slave don't match up) Can you please attach a zip file containing a single cohesive example of: a master solrconfig.xml a slave solrconfig.xml (if different) the log files for both the master and slave from JVM startup to jvm shutdown any number of files containing the full response of getting the replication details form the master & slave with the timestamp of when these files were generated. The timestamps being crucial to helping us stitch together where in the flow of log messages you are seeing this inconsistency. If you are using bash, running two commands like this (one for master and one for slave) while solr is running should be suitable... while true; do date --utc && curl -sS "http://localhost:8983/solr/collection1/replication?command=details&indent=truewt=json" && echo && echo && sleep 2; done 2>&1 > slave_rep_details.txt
          Hide
          Hoss Man added a comment -

          Step to reproduce this issue is simple. Just apply commit on master without posting any documents

          Ah... ok, wait a minute - this looks promising. in reviewing the logs you sent to solr-user@lucene i didn't notice that this "empty commit" was happening, but in attempting to reproduce i think i see what you're talking about.

          Steps i followed...

          1) took the Solr 4.2 example/solr dir and cloned it as "master-home"

          2) edited the /replication handler to match what you posted in this issue (but adjusted the replicaiton time to 30 seconds for a faster test)

          3) cloned "master-home" as "slave-home"

          4) ran two instances of Solr 4.2 using hte following commands...

          java -Denable.master=true -Dmaster.port=8999 -Djetty.port=8999 -Dsolr.solr.home=/home/hossman/tmp/ave_version_higher/master-home -jar start.jar &> /home/hossman/tmp/slave_version_higher/master.log
          java -Denable.slave=true -Dmaster.port=899-Djetty.port=9999 -Dsolr.solr.home=/home/hossman/tmp/slave_version_higher/slave-home -jar start.jar &> /home/hossman/tmp/slave_version_higher/slave.log
          

          5) ran two scripts to monitor replication details using the following commands...

          while true; do date --utc && curl -sS "http://localhost:9999/solr/collection1/replication?command=details&indent=truewt=json" && echo && echo && sleep 2; done &> slave_rep_details.txt
          while true; do date --utc && curl -sS "http://localhost:8999/solr/collection1/replication?command=details&indent=truewt=json" && echo && echo && sleep 2; done &> master_rep_details.txt
          

          6) triggered an indexing of all the example docs on master, and waited for replication.

          java -Durl=http://localhost:8999/solr/collection1/update -jar post.jar *.xml
          

          7) triggered an explicit commit on master...

          java -Durl=http://localhost:8999/solr/collection1/update -jar post.jar -
          

          8) shutdown both servers and the scripts (Ctrl-C)

          I've attached the full logs and home dirs at the completion of this test, but as a summmary of the results...

          a) slave & master index files are identical except for segments.gen

          b) the master's replication details indicate that the current commit being used is "indexVersion#1364927050819, generation#2" but it's list of commits does not include this, it contains a single commit of "indexVersion#1364927114002, generation#3"

          c) the slave's replication details indicate that the current commit being used is "indexVersion#1364927050819, generation#2" and that this is the only commit it has locally. The slave's information about hte master is consistent with what the master itself reports.

          I'm not certain, but I believe this is just an optimization where the searcher is not re-opened when the currently opened "commit" is identical to the new commit – this optimization is working on the master, but aparently not on the slave (maybe the slave can't tell that the commits are identical?)

          FWIW: after running this test, i restarted the master and it's replication details were consistent with the list of commit points – it was using generation #3.
          You can also observe the exact same behavior from master's replication details (current generation lower then the generation of any commit point) if you do a hard commit with openSearcher=false.


          I think most of the behavior here makes sense – the slave is replicating the commits from the master, even if the master isn't using them yet because it hasn't opened a new searcher. The key questions i wonder about:

          1) why was segments.get different when i ran my experiment? is that normal?
          2) Assuming i'm correct about their being an optimization to not open a new searcher if the commits are identical, can we make this same optimization work on slaves in the case of replication?

          Show
          Hoss Man added a comment - Step to reproduce this issue is simple. Just apply commit on master without posting any documents Ah... ok, wait a minute - this looks promising. in reviewing the logs you sent to solr-user@lucene i didn't notice that this "empty commit" was happening, but in attempting to reproduce i think i see what you're talking about. Steps i followed... 1) took the Solr 4.2 example/solr dir and cloned it as "master-home" 2) edited the /replication handler to match what you posted in this issue (but adjusted the replicaiton time to 30 seconds for a faster test) 3) cloned "master-home" as "slave-home" 4) ran two instances of Solr 4.2 using hte following commands... java -Denable.master=true -Dmaster.port=8999 -Djetty.port=8999 -Dsolr.solr.home=/home/hossman/tmp/ave_version_higher/master-home -jar start.jar &> /home/hossman/tmp/slave_version_higher/master.log java -Denable.slave=true -Dmaster.port=899-Djetty.port=9999 -Dsolr.solr.home=/home/hossman/tmp/slave_version_higher/slave-home -jar start.jar &> /home/hossman/tmp/slave_version_higher/slave.log 5) ran two scripts to monitor replication details using the following commands... while true; do date --utc && curl -sS "http://localhost:9999/solr/collection1/replication?command=details&indent=truewt=json" && echo && echo && sleep 2; done &> slave_rep_details.txt while true; do date --utc && curl -sS "http://localhost:8999/solr/collection1/replication?command=details&indent=truewt=json" && echo && echo && sleep 2; done &> master_rep_details.txt 6) triggered an indexing of all the example docs on master, and waited for replication. java -Durl=http://localhost:8999/solr/collection1/update -jar post.jar *.xml 7) triggered an explicit commit on master... java -Durl=http://localhost:8999/solr/collection1/update -jar post.jar - 8) shutdown both servers and the scripts (Ctrl-C) I've attached the full logs and home dirs at the completion of this test, but as a summmary of the results... a) slave & master index files are identical except for segments.gen b) the master's replication details indicate that the current commit being used is "indexVersion#1364927050819, generation#2" but it's list of commits does not include this, it contains a single commit of "indexVersion#1364927114002, generation#3" c) the slave's replication details indicate that the current commit being used is "indexVersion#1364927050819, generation#2" and that this is the only commit it has locally. The slave's information about hte master is consistent with what the master itself reports. I'm not certain, but I believe this is just an optimization where the searcher is not re-opened when the currently opened "commit" is identical to the new commit – this optimization is working on the master, but aparently not on the slave (maybe the slave can't tell that the commits are identical?) FWIW: after running this test, i restarted the master and it's replication details were consistent with the list of commit points – it was using generation #3. You can also observe the exact same behavior from master's replication details (current generation lower then the generation of any commit point) if you do a hard commit with openSearcher=false. I think most of the behavior here makes sense – the slave is replicating the commits from the master, even if the master isn't using them yet because it hasn't opened a new searcher. The key questions i wonder about: 1) why was segments.get different when i ran my experiment? is that normal? 2) Assuming i'm correct about their being an optimization to not open a new searcher if the commits are identical, can we make this same optimization work on slaves in the case of replication?
          Hide
          Mark Miller added a comment -

          The slave will actually do a reload if conf files are being replicated - so it's sure to open a new searcher currently. They generally shouldn't change every time though. It's a surprise that the master wouldn't open a new searcher - not sure where that would get short circuited - would have to dig a little.

          Show
          Mark Miller added a comment - The slave will actually do a reload if conf files are being replicated - so it's sure to open a new searcher currently. They generally shouldn't change every time though. It's a surprise that the master wouldn't open a new searcher - not sure where that would get short circuited - would have to dig a little.
          Hide
          Hoss Man added a comment -

          Assuming i'm correct about their being an optimization to not open a new searcher if the commits are identical,

          I did some more experimenting and confirmed i was wrong about this – from Solr's perspective a new searcher is definitely getting opened and warmed.

          I'm not sure, but skimming the code I think this discrepancy between the commit point in use and the commit point on disk may just be a result of using DirectoryReader.openIfChanged in SolrCore.openNewSearcher ... i got lost a bit in the code, but is it possible the reader DirectoryReader is returning null because only the generation changed by the comment, but not any of the actual indexed data?

          Show
          Hoss Man added a comment - Assuming i'm correct about their being an optimization to not open a new searcher if the commits are identical, I did some more experimenting and confirmed i was wrong about this – from Solr's perspective a new searcher is definitely getting opened and warmed. I'm not sure, but skimming the code I think this discrepancy between the commit point in use and the commit point on disk may just be a result of using DirectoryReader.openIfChanged in SolrCore.openNewSearcher ... i got lost a bit in the code, but is it possible the reader DirectoryReader is returning null because only the generation changed by the comment, but not any of the actual indexed data?
          Hide
          Mark Miller added a comment -

          It seems possible:

              // If in fact no changes took place, return null:
              if (reader.getVersion() == segmentInfos.getVersion()) {
                reader.decRef();
                return null;
              }
          

          version described as:

            /**
             * Version number when this IndexReader was opened.
             *
             * <p>This method
             * returns the version recorded in the commit that the
             * reader opened.  This version is advanced every time
             * a change is made with {@link IndexWriter}.</p>
             */
          

          version is incremented when:

            /** Call this before committing if changes have been made to the
             *  segments. */
            public void changed() {
              version++;
            }
          
          Show
          Mark Miller added a comment - It seems possible: // If in fact no changes took place, return null : if (reader.getVersion() == segmentInfos.getVersion()) { reader.decRef(); return null ; } version described as: /** * Version number when this IndexReader was opened. * * <p>This method * returns the version recorded in the commit that the * reader opened. This version is advanced every time * a change is made with {@link IndexWriter}.</p> */ version is incremented when: /** Call this before committing if changes have been made to the * segments. */ public void changed() { version++; }
          Hide
          Hoss Man added a comment -

          Some more (manual) experimentation (doing empty commits, or commits with openSearcher=false on the master, triggering slave replication to get the slave "ahead" and then doing subsequent changes on master & replicating again) seems to confirm that this situation doesn't cause any actual problem in replication – just confusion when looking at the replication details.

          we can add automated tests for this type of situation to future proof ourselves against the risk that something like this causes problems w/replication in the future, but let's do that in SOLR-4629 ... i think the crux of this issue is really one of mitigating user confusion.

          I suggest we update the admin UI to de-emphasize the comparison of the slave(version+generation) to master(version+generation), and instead emphasize a comparison of the slave(version+generation) with the master(replicateableVersion+replicatableGeneration).

          replicatableGeneration is already available in the details, we should be able to add replicatableVersion easy enough.

          Show
          Hoss Man added a comment - Some more (manual) experimentation (doing empty commits, or commits with openSearcher=false on the master, triggering slave replication to get the slave "ahead" and then doing subsequent changes on master & replicating again) seems to confirm that this situation doesn't cause any actual problem in replication – just confusion when looking at the replication details. we can add automated tests for this type of situation to future proof ourselves against the risk that something like this causes problems w/replication in the future, but let's do that in SOLR-4629 ... i think the crux of this issue is really one of mitigating user confusion. I suggest we update the admin UI to de-emphasize the comparison of the slave(version+generation) to master(version+generation), and instead emphasize a comparison of the slave(version+generation) with the master(replicateableVersion+replicatableGeneration). replicatableGeneration is already available in the details, we should be able to add replicatableVersion easy enough.
          Hide
          Hoss Man added a comment -

          Here's a patch that updates the Admin UI Replication screen to distinguish the difference between what is currently searchable on the master, and what is currently replicatable.

          differences are highlighted on the slave only if it differs from what is replicatable on the master (not what is searchable) so this should hopefully help reduce confusion.

          Concerns i still have with this patch...

          1) Stefan Matheis (steffkes): please ensure i didn't butcher the admin UI js too badly ... i think i've covered all the edge cases.
          2) "Replicatable" is not a word, but using "Replicable" doesn't have the right definition ... is there a different word we should use here, or should we just do our part to make Replicatable a word?

          Show
          Hoss Man added a comment - Here's a patch that updates the Admin UI Replication screen to distinguish the difference between what is currently searchable on the master, and what is currently replicatable. differences are highlighted on the slave only if it differs from what is replicatable on the master (not what is searchable) so this should hopefully help reduce confusion. Concerns i still have with this patch... 1) Stefan Matheis (steffkes) : please ensure i didn't butcher the admin UI js too badly ... i think i've covered all the edge cases. 2) "Replicatable" is not a word, but using "Replicable" doesn't have the right definition ... is there a different word we should use here, or should we just do our part to make Replicatable a word?
          Hide
          Hoss Man added a comment -

          update summary & description based on investigation

          Show
          Hoss Man added a comment - update summary & description based on investigation
          Hide
          Hoss Man added a comment -

          updated patch to trunk and switched terminology to "replicable" (the more i thought about it, the less i thought that the definition wasn't applicable)

          Show
          Hoss Man added a comment - updated patch to trunk and switched terminology to "replicable" (the more i thought about it, the less i thought that the definition wasn't applicable)
          Hide
          Stefan Matheis (steffkes) added a comment -

          Updated Patch, includes changes from the replication page for the dashboard (were the same information also appear).

          Missing Information/Empty Versions are shown as '-' so it should be clear for everyone

          Show
          Stefan Matheis (steffkes) added a comment - Updated Patch, includes changes from the replication page for the dashboard (were the same information also appear). Missing Information/Empty Versions are shown as '-' so it should be clear for everyone
          Hide
          Joel Bernstein added a comment -

          I'm concerned about the empty commits opening a new index searcher on the slave. In my testing I found that empty commits do not cause a new searcher to open on master, but do open a new searcher on the slave. This feels like a bug. I'll investigate further. Wondering what peoples thoughts are on this?

          Show
          Joel Bernstein added a comment - I'm concerned about the empty commits opening a new index searcher on the slave. In my testing I found that empty commits do not cause a new searcher to open on master, but do open a new searcher on the slave. This feels like a bug. I'll investigate further. Wondering what peoples thoughts are on this?
          Hide
          Hoss Man added a comment -

          Joel: please note my earlier comments...

          I did some more experimenting and confirmed i was wrong about this – from Solr's perspective a new searcher is definitely getting opened and warmed.

          ...you can clearly see in the logs that an "empty" commit causes a new Searcher to be opened in any situation – but when it happens on the master the underlying call to "DirectoryReader.openIfChanged" recognizes that the commit point in the directory is identical. On the slave side it may not recognize this because the physical directory itself can change (ie: index vs index.XXXXXXXX)

          Even if there is a bug here (or room for optimization on the slave side) we should track it separately since the crux of thies issue (the UI is comparing apples and oranges) would still be true – notably in the case where someone may do an openSearcher=false commit on the master, creating a new replicable version for slaves w/o opening a new searcher on the master.

          Show
          Hoss Man added a comment - Joel: please note my earlier comments... I did some more experimenting and confirmed i was wrong about this – from Solr's perspective a new searcher is definitely getting opened and warmed. ...you can clearly see in the logs that an "empty" commit causes a new Searcher to be opened in any situation – but when it happens on the master the underlying call to "DirectoryReader.openIfChanged" recognizes that the commit point in the directory is identical. On the slave side it may not recognize this because the physical directory itself can change (ie: index vs index.XXXXXXXX) Even if there is a bug here (or room for optimization on the slave side) we should track it separately since the crux of thies issue (the UI is comparing apples and oranges) would still be true – notably in the case where someone may do an openSearcher=false commit on the master, creating a new replicable version for slaves w/o opening a new searcher on the master.
          Hide
          Shawn Heisey added a comment -

          That assignment was done accidentally with keyboard shortcuts. I tried to load a new URL in the Firefox tab by pressing Ctrl-L and typing the domain name, but apparently Jira's keyboard shortcuts take precedence over the browser.

          Show
          Shawn Heisey added a comment - That assignment was done accidentally with keyboard shortcuts. I tried to load a new URL in the Firefox tab by pressing Ctrl-L and typing the domain name, but apparently Jira's keyboard shortcuts take precedence over the browser.
          Hide
          Hoss Man added a comment -

          Committed revision 1469073.
          Committed revision 1469074.
          Committed revision 1469076.

          Show
          Hoss Man added a comment - Committed revision 1469073. Committed revision 1469074. Committed revision 1469076.
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Aditya
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development