Solr
  1. Solr
  2. SOLR-4260

Inconsistent numDocs between leader and replica

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      5.0.0.2013.01.04.15.31.51

      Description

      After wiping all cores and reindexing some 3.3 million docs from Nutch using CloudSolrServer we see inconsistencies between the leader and replica for some shards.

      Each core hold about 3.3k documents. For some reason 5 out of 10 shards have a small deviation in then number of documents. The leader and slave deviate for roughly 10-20 documents, not more.

      Results hopping ranks in the result set for identical queries got my attention, there were small IDF differences for exactly the same record causing a record to shift positions in the result set. During those tests no records were indexed. Consecutive catch all queries also return different number of numDocs.

      We're running a 10 node test cluster with 10 shards and a replication factor of two and frequently reindex using a fresh build from trunk. I've not seen this issue for quite some time until a few days ago.

      1. 192.168.20.102-replica1.png
        28 kB
        Yago Riveiro
      2. 192.168.20.104-replica2.png
        28 kB
        Yago Riveiro
      3. clusterstate.png
        74 kB
        Yago Riveiro
      4. demo_shard1_replicas_out_of_sync.tgz
        4.03 MB
        Timothy Potter
      5. SOLR-4260.patch
        4 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          For some reason 5 out of 10 shards have a small deviation in then number of documents.

          numDocs or maxDoc?
          You can expect variations in maxDoc (due to the same document being added more than once in recovery scenarios). numDocs should be identical of course.

          Oh, and that variation in maxDoc will cause small differences in IDF too.

          Show
          Yonik Seeley added a comment - For some reason 5 out of 10 shards have a small deviation in then number of documents. numDocs or maxDoc? You can expect variations in maxDoc (due to the same document being added more than once in recovery scenarios). numDocs should be identical of course. Oh, and that variation in maxDoc will cause small differences in IDF too.
          Hide
          Mark Miller added a comment -

          What's fresh? We fixed a couple issue for this within the last couple weeks.

          Show
          Mark Miller added a comment - What's fresh? We fixed a couple issue for this within the last couple weeks.
          Hide
          Mark Miller added a comment -

          Looks like fresh means from today?

          Show
          Mark Miller added a comment - Looks like fresh means from today?
          Hide
          Markus Jelsma added a comment - - edited

          Yonik, i explicitly look at numDocs. Shard_B has Num Docs: 335986, Max Doc: 336079 on one node and Num Docs: 335976 Max Doc: 336091 on the other.

          Mark: yes, this is today, 5.0.0.2013.01.04.15.31.51

          Show
          Markus Jelsma added a comment - - edited Yonik, i explicitly look at numDocs. Shard_B has Num Docs: 335986, Max Doc: 336079 on one node and Num Docs: 335976 Max Doc: 336091 on the other. Mark: yes, this is today, 5.0.0.2013.01.04.15.31.51
          Hide
          Markus Jelsma added a comment -

          I tried to reproduce this twice with today's check out of trunk manner but failed to consistently reproduce it. I did see a small deviation while no data was coming in and no recovery was reported to be in progress, in the end all replicates where in sync and had, accoring to Luke, identical numDocs.

          I'll keep an eye on this in the next couple of days.

          btw: in this test cluster we use docCount and not maxDoc for our IDF calculation.

          Show
          Markus Jelsma added a comment - I tried to reproduce this twice with today's check out of trunk manner but failed to consistently reproduce it. I did see a small deviation while no data was coming in and no recovery was reported to be in progress, in the end all replicates where in sync and had, accoring to Luke, identical numDocs. I'll keep an eye on this in the next couple of days. btw: in this test cluster we use docCount and not maxDoc for our IDF calculation.
          Hide
          Mark Miller added a comment -

          Thanks - would be good to make sure this is okay. FYI, seeing things off for a short time is somewhat expected - there is some 'eventual' consistency here - but it should quickly consistent itself up - under heavy bulk indexing is when you would be most likely to have the most "eventualness".

          Show
          Mark Miller added a comment - Thanks - would be good to make sure this is okay. FYI, seeing things off for a short time is somewhat expected - there is some 'eventual' consistency here - but it should quickly consistent itself up - under heavy bulk indexing is when you would be most likely to have the most "eventualness".
          Hide
          Markus Jelsma added a comment -

          Yes, we see that behaviour all the time but it is expected indeed. The problem with with this issue is that it was not a dozen seconds or a few minutes but it was consistently wrong for almost an hour, then i wrote this issue and left the office

          Show
          Markus Jelsma added a comment - Yes, we see that behaviour all the time but it is expected indeed. The problem with with this issue is that it was not a dozen seconds or a few minutes but it was consistently wrong for almost an hour, then i wrote this issue and left the office
          Hide
          Markus Jelsma added a comment -

          I've got it again. This time numDocs is consistent but some facet counts are not consistent between leader and replica. Here are two facet counts for one node:

             <lst name="domain">
                <int name="domain_a">238620</int>
                <int name="domain_b">218</int>
          

          and the other:

             <lst name="domain">
                <int name="domain_a">238621</int>
                <int name="domain_b">217</int>
          
          Show
          Markus Jelsma added a comment - I've got it again. This time numDocs is consistent but some facet counts are not consistent between leader and replica. Here are two facet counts for one node: <lst name= "domain" > < int name= "domain_a" >238620</ int > < int name= "domain_b" >218</ int > and the other: <lst name= "domain" > < int name= "domain_a" >238621</ int > < int name= "domain_b" >217</ int >
          Hide
          Mark Miller added a comment -

          Not sure I'd say it's 'it' again yet - numdocs and facet counts are quite different.

          Interesting though - does that persist after a hard commit / open new searcher?

          Show
          Mark Miller added a comment - Not sure I'd say it's 'it' again yet - numdocs and facet counts are quite different. Interesting though - does that persist after a hard commit / open new searcher?
          Hide
          Markus Jelsma added a comment -

          You're right, it is different, my bad. A hard commit is issues automatically once in a while and the issue persists and after a manual hard commit as well.

          Interestingly, we see that the docCounts returned by CollectionStatistics.docCount() is inconsistent between leader and replica for each shard. As Yonik said, it's normal when using maxDoc but we don't use maxDoc in this set up, docCount should be correct. Since it isn't, our IDF is sometimes skewed, causing docs to jump position in the result set.

          Show
          Markus Jelsma added a comment - You're right, it is different, my bad. A hard commit is issues automatically once in a while and the issue persists and after a manual hard commit as well. Interestingly, we see that the docCounts returned by CollectionStatistics.docCount() is inconsistent between leader and replica for each shard. As Yonik said, it's normal when using maxDoc but we don't use maxDoc in this set up, docCount should be correct. Since it isn't, our IDF is sometimes skewed, causing docs to jump position in the result set.
          Hide
          Markus Jelsma added a comment -

          I've removed domain_b from the index and as i expected the numDocs is now inconsistent indeed. By coincidence the what was missing in one replica from domain_a was replaced by an extra doc from domain_b and vice versa.

          The collection of a couple of million records has one replica that's missing one document.

          Show
          Markus Jelsma added a comment - I've removed domain_b from the index and as i expected the numDocs is now inconsistent indeed. By coincidence the what was missing in one replica from domain_a was replaced by an extra doc from domain_b and vice versa. The collection of a couple of million records has one replica that's missing one document.
          Hide
          Markus Jelsma added a comment -

          Here's the debug output of the same query executed on both the leader and the replica. This set up uses and overridden BM25Similarity that returns docCount() for IDF instead of maxDoc. In this case both have the equal number of documents in the index so something else doesn't seem right. Facet counts add up, both leader and replicaa have the same number of documents per domain.

          43.960983 = (MATCH) sum of:
            43.960983 = (MATCH) max plus 0.35 times others of:
              29.059849 = (MATCH) weight(title_nl:amsterdam^6.4 in 14437) [], result of:
                29.059849 = score(doc=14437,freq=3.0 = termFreq=3.0
          ), product of:
                  6.4 = boost
                  2.889473 = idf(docFreq=18368, docCount=330335)
                  1.5714288 = tfNorm, computed from:
                    3.0 = termFreq=3.0
                    1.2 = parameter k1
                    0.0 = parameter b (norms omitted for field)
              4.651832 = (MATCH) weight(content_nl:amsterdam^1.6 in 14437) [], result of:
                4.651832 = score(doc=14437,freq=1.0 = termFreq=1.0
          ), product of:
                  1.6 = boost
                  2.9073951 = idf(docFreq=18039, docCount=330285)
                  1.0 = tfNorm, computed from:
                    1.0 = termFreq=1.0
                    1.2 = parameter k1
                    0.0 = parameter b (norms omitted for field)
              32.161896 = (MATCH) weight(url:amsterdam^3.64 in 14437) [], result of:
                32.161896 = score(doc=14437,freq=2.0 = termFreq=2.0
          ), product of:
                  3.64 = boost
                  6.328843 = idf(docFreq=608, docCount=341068)
                  1.396098 = tfNorm, computed from:
                    2.0 = termFreq=2.0
                    1.2 = parameter k1
                    0.75 = parameter b
                    4.227131 = avgFieldLength
                    4.0 = fieldLength
          
          45.993042 = (MATCH) sum of:
            45.993042 = (MATCH) max plus 0.35 times others of:
              28.35725 = (MATCH) weight(title_nl:amsterdam^6.4 in 170479) [], result of:
                28.35725 = score(doc=170479,freq=3.0 = termFreq=3.0
          ), product of:
                  6.4 = boost
                  2.8196125 = idf(docFreq=16736, docCount=280676)
                  1.5714288 = tfNorm, computed from:
                    3.0 = termFreq=3.0
                    1.2 = parameter k1
                    0.0 = parameter b (norms omitted for field)
              4.577688 = (MATCH) weight(content_nl:amsterdam^1.6 in 170479) [], result of:
                4.577688 = score(doc=170479,freq=1.0 = termFreq=1.0
          ), product of:
                  1.6 = boost
                  2.8610551 = idf(docFreq=16054, docCount=280631)
                  1.0 = tfNorm, computed from:
                    1.0 = termFreq=1.0
                    1.2 = parameter k1
                    0.0 = parameter b (norms omitted for field)
              34.465813 = (MATCH) weight(url:amsterdam^3.64 in 170479) [], result of:
                34.465813 = score(doc=170479,freq=2.0 = termFreq=2.0
          ), product of:
                  3.64 = boost
                  6.798851 = idf(docFreq=323, docCount=290119)
                  1.3926809 = tfNorm, computed from:
                    2.0 = termFreq=2.0
                    1.2 = parameter k1
                    0.75 = parameter b
                    4.189095 = avgFieldLength
                    4.0 = fieldLength
          

          It's clear that not only docCount is different but also docFreq while both should be equal on the leader and replica. This makes a mess of the final score!

          Anyone else here that has seen this issue?

          Show
          Markus Jelsma added a comment - Here's the debug output of the same query executed on both the leader and the replica. This set up uses and overridden BM25Similarity that returns docCount() for IDF instead of maxDoc. In this case both have the equal number of documents in the index so something else doesn't seem right. Facet counts add up, both leader and replicaa have the same number of documents per domain. 43.960983 = (MATCH) sum of: 43.960983 = (MATCH) max plus 0.35 times others of: 29.059849 = (MATCH) weight(title_nl:amsterdam^6.4 in 14437) [], result of: 29.059849 = score(doc=14437,freq=3.0 = termFreq=3.0 ), product of: 6.4 = boost 2.889473 = idf(docFreq=18368, docCount=330335) 1.5714288 = tfNorm, computed from: 3.0 = termFreq=3.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 4.651832 = (MATCH) weight(content_nl:amsterdam^1.6 in 14437) [], result of: 4.651832 = score(doc=14437,freq=1.0 = termFreq=1.0 ), product of: 1.6 = boost 2.9073951 = idf(docFreq=18039, docCount=330285) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 32.161896 = (MATCH) weight(url:amsterdam^3.64 in 14437) [], result of: 32.161896 = score(doc=14437,freq=2.0 = termFreq=2.0 ), product of: 3.64 = boost 6.328843 = idf(docFreq=608, docCount=341068) 1.396098 = tfNorm, computed from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 4.227131 = avgFieldLength 4.0 = fieldLength 45.993042 = (MATCH) sum of: 45.993042 = (MATCH) max plus 0.35 times others of: 28.35725 = (MATCH) weight(title_nl:amsterdam^6.4 in 170479) [], result of: 28.35725 = score(doc=170479,freq=3.0 = termFreq=3.0 ), product of: 6.4 = boost 2.8196125 = idf(docFreq=16736, docCount=280676) 1.5714288 = tfNorm, computed from: 3.0 = termFreq=3.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 4.577688 = (MATCH) weight(content_nl:amsterdam^1.6 in 170479) [], result of: 4.577688 = score(doc=170479,freq=1.0 = termFreq=1.0 ), product of: 1.6 = boost 2.8610551 = idf(docFreq=16054, docCount=280631) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 34.465813 = (MATCH) weight(url:amsterdam^3.64 in 170479) [], result of: 34.465813 = score(doc=170479,freq=2.0 = termFreq=2.0 ), product of: 3.64 = boost 6.798851 = idf(docFreq=323, docCount=290119) 1.3926809 = tfNorm, computed from: 2.0 = termFreq=2.0 1.2 = parameter k1 0.75 = parameter b 4.189095 = avgFieldLength 4.0 = fieldLength It's clear that not only docCount is different but also docFreq while both should be equal on the leader and replica. This makes a mess of the final score! Anyone else here that has seen this issue?
          Hide
          Markus Jelsma added a comment -

          Ok, this is not a SolrCloud issue, i can also reproduce this in stand-alone and multi core set ups. This is also not a problem of BM25 since TFIDF has the same problem. Neither docCount vs. maxCount seems to be the problem.

          I now have two identical cores set up and index the same data to both, no problem, everything is very consistent. Then i'll reindex the same data again to only one of the two cores and then the trouble starts. There is a small variation in maxDoc which is expected but there is also a variation in docFreq which is very unexpected, docFreq must not change at all if i reindex the same data.

          Here's an debug snippet of the first core that did not receive reindexed data:

          910.47974 = (MATCH) sum of:
            910.47974 = (MATCH) max plus 0.35 times others of:
              793.99835 = (MATCH) weight(title_en:groningen^6.4 in 5132) [], result of:
                793.99835 = score(doc=5132,freq=1.0 = termFreq=1.0
          ), product of:
                  71.28527 = queryWeight, product of:
                    6.4 = boost
                    11.138323 = idf(docFreq=1, maxDocs=50588)
                    1.0 = queryNorm
                  11.138323 = fieldWeight in 5132, product of:
                    1.0 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    11.138323 = idf(docFreq=1, maxDocs=50588)
                    1.0 = fieldNorm(doc=5132)
              312.06528 = (MATCH) weight(content_en:groningen^1.6 in 5132) [], result of:
                312.06528 = score(doc=5132,freq=1.0 = termFreq=1.0
          ), product of:
                  17.172573 = queryWeight, product of:
                    1.6 = boost
                    10.732858 = idf(docFreq=2, maxDocs=50588)
                    1.0 = queryNorm
                  18.172308 = fieldWeight in 5132, product of:
                    1.6931472 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    10.732858 = idf(docFreq=2, maxDocs=50588)
                    1.0 = fieldNorm(doc=5132)
              20.73867 = (MATCH) weight(domain_grams:groningen^3.7 in 5132) [], result of:
                20.73867 = score(doc=5132,freq=1.0 = termFreq=1.0
          ), product of:
                  26.48697 = queryWeight, product of:
                    3.7 = boost
                    7.158641 = idf(docFreq=106, maxDocs=50588)
                    1.0 = queryNorm
                  0.7829763 = fieldWeight in 5132, product of:
                    1.0 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    7.158641 = idf(docFreq=106, maxDocs=50588)
                    0.109375 = fieldNorm(doc=5132)
          

          Here's the debug of the same doc on the core which i reindexed the same data to:

          928.31537 = (MATCH) sum of:
            928.31537 = (MATCH) max plus 0.35 times others of:
              815.29694 = (MATCH) weight(title_en:groningen^6.4 in 31881) [], result of:
                815.29694 = score(doc=31881,freq=1.0 = termFreq=1.0
          ), product of:
                  72.23504 = queryWeight, product of:
                    6.4 = boost
                    11.286724 = idf(docFreq=1, maxDocs=58681)
                    1.0 = queryNorm
                  11.286724 = fieldWeight in 31881, product of:
                    1.0 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    11.286724 = idf(docFreq=1, maxDocs=58681)
                    1.0 = fieldNorm(doc=31881)
              304.0185 = (MATCH) weight(content_en:groningen^1.6 in 31881) [], result of:
                304.0185 = score(doc=31881,freq=1.0 = termFreq=1.0
          ), product of:
                  16.949724 = queryWeight, product of:
                    1.6 = boost
                    10.593577 = idf(docFreq=3, maxDocs=58681)
                    1.0 = queryNorm
                  17.936485 = fieldWeight in 31881, product of:
                    1.6931472 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    10.593577 = idf(docFreq=3, maxDocs=58681)
                    1.0 = fieldNorm(doc=31881)
              18.891369 = (MATCH) weight(domain_grams:groningen^3.7 in 31881) [], result of:
                18.891369 = score(doc=31881,freq=1.0 = termFreq=1.0
          ), product of:
                  25.279795 = queryWeight, product of:
                    3.7 = boost
                    6.832377 = idf(docFreq=171, maxDocs=58681)
                    1.0 = queryNorm
                  0.7472912 = fieldWeight in 31881, product of:
                    1.0 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    6.832377 = idf(docFreq=171, maxDocs=58681)
                    0.109375 = fieldNorm(doc=31881)
          

          As you can see, docFreq has changed but the number of documents is still the same. Since i now suspect the merging of segments has something to do with it i'll send an optimize command to the node that i reindexed data to.

          After optimizing (or forcing all segments to be merged) i get the same debug as i had for the first node that i didn't reindex to!

          Show
          Markus Jelsma added a comment - Ok, this is not a SolrCloud issue, i can also reproduce this in stand-alone and multi core set ups. This is also not a problem of BM25 since TFIDF has the same problem. Neither docCount vs. maxCount seems to be the problem. I now have two identical cores set up and index the same data to both, no problem, everything is very consistent. Then i'll reindex the same data again to only one of the two cores and then the trouble starts. There is a small variation in maxDoc which is expected but there is also a variation in docFreq which is very unexpected, docFreq must not change at all if i reindex the same data. Here's an debug snippet of the first core that did not receive reindexed data: 910.47974 = (MATCH) sum of: 910.47974 = (MATCH) max plus 0.35 times others of: 793.99835 = (MATCH) weight(title_en:groningen^6.4 in 5132) [], result of: 793.99835 = score(doc=5132,freq=1.0 = termFreq=1.0 ), product of: 71.28527 = queryWeight, product of: 6.4 = boost 11.138323 = idf(docFreq=1, maxDocs=50588) 1.0 = queryNorm 11.138323 = fieldWeight in 5132, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 11.138323 = idf(docFreq=1, maxDocs=50588) 1.0 = fieldNorm(doc=5132) 312.06528 = (MATCH) weight(content_en:groningen^1.6 in 5132) [], result of: 312.06528 = score(doc=5132,freq=1.0 = termFreq=1.0 ), product of: 17.172573 = queryWeight, product of: 1.6 = boost 10.732858 = idf(docFreq=2, maxDocs=50588) 1.0 = queryNorm 18.172308 = fieldWeight in 5132, product of: 1.6931472 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 10.732858 = idf(docFreq=2, maxDocs=50588) 1.0 = fieldNorm(doc=5132) 20.73867 = (MATCH) weight(domain_grams:groningen^3.7 in 5132) [], result of: 20.73867 = score(doc=5132,freq=1.0 = termFreq=1.0 ), product of: 26.48697 = queryWeight, product of: 3.7 = boost 7.158641 = idf(docFreq=106, maxDocs=50588) 1.0 = queryNorm 0.7829763 = fieldWeight in 5132, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.158641 = idf(docFreq=106, maxDocs=50588) 0.109375 = fieldNorm(doc=5132) Here's the debug of the same doc on the core which i reindexed the same data to: 928.31537 = (MATCH) sum of: 928.31537 = (MATCH) max plus 0.35 times others of: 815.29694 = (MATCH) weight(title_en:groningen^6.4 in 31881) [], result of: 815.29694 = score(doc=31881,freq=1.0 = termFreq=1.0 ), product of: 72.23504 = queryWeight, product of: 6.4 = boost 11.286724 = idf(docFreq=1, maxDocs=58681) 1.0 = queryNorm 11.286724 = fieldWeight in 31881, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 11.286724 = idf(docFreq=1, maxDocs=58681) 1.0 = fieldNorm(doc=31881) 304.0185 = (MATCH) weight(content_en:groningen^1.6 in 31881) [], result of: 304.0185 = score(doc=31881,freq=1.0 = termFreq=1.0 ), product of: 16.949724 = queryWeight, product of: 1.6 = boost 10.593577 = idf(docFreq=3, maxDocs=58681) 1.0 = queryNorm 17.936485 = fieldWeight in 31881, product of: 1.6931472 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 10.593577 = idf(docFreq=3, maxDocs=58681) 1.0 = fieldNorm(doc=31881) 18.891369 = (MATCH) weight(domain_grams:groningen^3.7 in 31881) [], result of: 18.891369 = score(doc=31881,freq=1.0 = termFreq=1.0 ), product of: 25.279795 = queryWeight, product of: 3.7 = boost 6.832377 = idf(docFreq=171, maxDocs=58681) 1.0 = queryNorm 0.7472912 = fieldWeight in 31881, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.832377 = idf(docFreq=171, maxDocs=58681) 0.109375 = fieldNorm(doc=31881) As you can see, docFreq has changed but the number of documents is still the same. Since i now suspect the merging of segments has something to do with it i'll send an optimize command to the node that i reindexed data to. After optimizing (or forcing all segments to be merged) i get the same debug as i had for the first node that i didn't reindex to!
          Hide
          Markus Jelsma added a comment -

          I updated the title to reflect the new issue. Can anyone confirm that with trunk they see different values for CollectionStatistics.docCount and docFreq?

          I can confirm that the variations disappear after optimizing one of our SolrCloud clusters. The only time results swap places is when the score is identical and docID is used to score.

          Show
          Markus Jelsma added a comment - I updated the title to reflect the new issue. Can anyone confirm that with trunk they see different values for CollectionStatistics.docCount and docFreq? I can confirm that the variations disappear after optimizing one of our SolrCloud clusters. The only time results swap places is when the score is identical and docID is used to score.
          Hide
          Yonik Seeley added a comment -

          There is a small variation in maxDoc which is expected but there is also a variation in docFreq which is very unexpected, docFreq must not change at all if i reindex the same data.

          Unfortunately, deletions don't change index statistics like docFreq (this has been the case since the first version of Lucene). This means that reindexing a document can artificially increase the docFreq until the deletion is really removed via merging/optimize.

          Show
          Yonik Seeley added a comment - There is a small variation in maxDoc which is expected but there is also a variation in docFreq which is very unexpected, docFreq must not change at all if i reindex the same data. Unfortunately, deletions don't change index statistics like docFreq (this has been the case since the first version of Lucene). This means that reindexing a document can artificially increase the docFreq until the deletion is really removed via merging/optimize.
          Hide
          Markus Jelsma added a comment -

          Of course, you're right! I got distracted by that fact and in the process renamed this issue while i shouldn't. The inconsistency between leader and replica on one shard is still here.

          I'll rename it back and raise the docFreq and docCount issue on the list. Sorry for the mess

          Show
          Markus Jelsma added a comment - Of course, you're right! I got distracted by that fact and in the process renamed this issue while i shouldn't. The inconsistency between leader and replica on one shard is still here. I'll rename it back and raise the docFreq and docCount issue on the list. Sorry for the mess
          Hide
          Markus Jelsma added a comment -

          Here's the index information for two cores of the same shard, running on different nodes.

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
          
          <lst name="responseHeader">
            <int name="status">0</int>
            <int name="QTime">1</int>
          </lst>
          <lst name="index">
            <int name="numDocs">117744</int>
            <int name="maxDoc">118160</int>
            <int name="deletedDocs">416</int>
            <long name="version">3802</long>
            <int name="segmentCount">15</int>
            <bool name="current">true</bool>
            <bool name="hasDeletions">true</bool>
            <str name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_h/data/index.20130211094737738 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2ca7563d; maxCacheMB=48.0 maxMergeSizeMB=4.0)</str>
            <lst name="userData">
              <str name="commitTimeMSec">1361265544970</str>
            </lst>
            <date name="lastModified">2013-02-19T09:19:04.97Z</date>
          </lst>
          </response>
          
          
          <?xml version="1.0" encoding="UTF-8"?>
          <response>
          
          <lst name="responseHeader">
            <int name="status">0</int>
            <int name="QTime">0</int>
          </lst>
          <lst name="index">
            <int name="numDocs">117767</int>
            <int name="maxDoc">118181</int>
            <int name="deletedDocs">414</int>
            <long name="version">3772</long>
            <int name="segmentCount">13</int>
            <bool name="current">true</bool>
            <bool name="hasDeletions">true</bool>
            <str name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_h/data/index.20130211105622621 lockFactory=org.apache.lucene.store.NativeFSLockFactory@684b4388; maxCacheMB=48.0 maxMergeSizeMB=4.0)</str>
            <lst name="userData">
              <str name="commitTimeMSec">1361265544937</str>
            </lst>
            <date name="lastModified">2013-02-19T09:19:04.937Z</date>
          </lst>
          </response>
          

          We send updates/deletes to the cluster every 10-15 minutes. The shard will not become synchronized, unless i remove the index of one of the nodes.

          Show
          Markus Jelsma added a comment - Here's the index information for two cores of the same shard, running on different nodes. <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > < int name= "status" >0</ int > < int name= "QTime" >1</ int > </lst> <lst name= "index" > < int name= "numDocs" >117744</ int > < int name= "maxDoc" >118160</ int > < int name= "deletedDocs" >416</ int > < long name= "version" >3802</ long > < int name= "segmentCount" >15</ int > <bool name= "current" > true </bool> <bool name= "hasDeletions" > true </bool> <str name= "directory" >org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_h/data/index.20130211094737738 lockFactory=org.apache.lucene.store.NativeFSLockFactory@2ca7563d; maxCacheMB=48.0 maxMergeSizeMB=4.0)</str> <lst name= "userData" > <str name= "commitTimeMSec" >1361265544970</str> </lst> <date name= "lastModified" >2013-02-19T09:19:04.97Z</date> </lst> </response> <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > < int name= "status" >0</ int > < int name= "QTime" >0</ int > </lst> <lst name= "index" > < int name= "numDocs" >117767</ int > < int name= "maxDoc" >118181</ int > < int name= "deletedDocs" >414</ int > < long name= "version" >3772</ long > < int name= "segmentCount" >13</ int > <bool name= "current" > true </bool> <bool name= "hasDeletions" > true </bool> <str name= "directory" >org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/cores/shard_h/data/index.20130211105622621 lockFactory=org.apache.lucene.store.NativeFSLockFactory@684b4388; maxCacheMB=48.0 maxMergeSizeMB=4.0)</str> <lst name= "userData" > <str name= "commitTimeMSec" >1361265544937</str> </lst> <date name= "lastModified" >2013-02-19T09:19:04.937Z</date> </lst> </response> We send updates/deletes to the cluster every 10-15 minutes. The shard will not become synchronized, unless i remove the index of one of the nodes.
          Hide
          Mark Miller added a comment -

          No interesting exceptions in the logs? Perhaps dial them up to warn and run?

          Show
          Mark Miller added a comment - No interesting exceptions in the logs? Perhaps dial them up to warn and run?
          Hide
          Markus Jelsma added a comment - - edited

          Nothing peculiar in the WARN logs. We don't log INFO usually unless something is really broken, that's too much data.

          Show
          Markus Jelsma added a comment - - edited Nothing peculiar in the WARN logs. We don't log INFO usually unless something is really broken, that's too much data.
          Hide
          Markus Jelsma added a comment -

          FYI: we're still seeing major inconsistencies, facet counts are off and when inspecting leaders and replica's we notice not all are in sync. This is on yesterday's trunk and with an empty index. There were no node failures during indexing. Shard_b's stats for example:

          node 2 shard b

          Last Modified:    about a minute ago
          Num Docs:    158964
          Max Doc:    158964
          Deleted Docs:    0
          Version:    4479
          Segment Count:    1
          

          node 3 shard b

          Last Modified:    2 minutes ago
          Num Docs:    158298
          Max Doc:    158298
          Deleted Docs:    0
          Version:    2886
          

          Size and versions are also different. Cluster is optimized/forceMerged but doesn't change the facts as expected. At least one other shard also has differences in its two replica's, i haven't manually checked the others.

          Show
          Markus Jelsma added a comment - FYI: we're still seeing major inconsistencies, facet counts are off and when inspecting leaders and replica's we notice not all are in sync. This is on yesterday's trunk and with an empty index. There were no node failures during indexing. Shard_b's stats for example: node 2 shard b Last Modified: about a minute ago Num Docs: 158964 Max Doc: 158964 Deleted Docs: 0 Version: 4479 Segment Count: 1 node 3 shard b Last Modified: 2 minutes ago Num Docs: 158298 Max Doc: 158298 Deleted Docs: 0 Version: 2886 Size and versions are also different. Cluster is optimized/forceMerged but doesn't change the facts as expected. At least one other shard also has differences in its two replica's, i haven't manually checked the others.
          Hide
          Mark Miller added a comment -

          See anything in the logs about zk expirations?

          Show
          Mark Miller added a comment - See anything in the logs about zk expirations?
          Hide
          Markus Jelsma added a comment -

          I've already restarted the job and enabled logging! It's going to take a while

          Show
          Markus Jelsma added a comment - I've already restarted the job and enabled logging! It's going to take a while
          Hide
          Markus Jelsma added a comment -

          Alright, nothing looks like zookeeper expirations i grepped expirations in the error log but there's nothing there. This indexing session did not produce so many inconsistencies as the previous one; there is only 1 shard of which one replica has 2 more documents. It won't fix itself.

          During indexing there were, as usual, error such as autocommit causing a searcher too many and time outs talking to other nodes.

          Only 2 nodes report a Stopping Recovery For of which one node actually has a replica of the inconsistent core. The other shard is seems fine, both replica's have the same numDocs.

          Show
          Markus Jelsma added a comment - Alright, nothing looks like zookeeper expirations i grepped expirations in the error log but there's nothing there. This indexing session did not produce so many inconsistencies as the previous one; there is only 1 shard of which one replica has 2 more documents. It won't fix itself. During indexing there were, as usual, error such as autocommit causing a searcher too many and time outs talking to other nodes. Only 2 nodes report a Stopping Recovery For of which one node actually has a replica of the inconsistent core. The other shard is seems fine, both replica's have the same numDocs.
          Hide
          Yago Riveiro added a comment -

          Hi, I hit this bug with solr 4.5.1

          replica 1:

          lastModified:20 minutes ago
          version:80616
          numDocs:6072661
          maxDoc:6072841
          deletedDocs:180

          replica 2 (leader)

          lastModified:20 minutes ago
          version:77595
          numDocs:6072575
          maxDoc:6072771
          deletedDocs:196

          I don't know when this happened, therefore I have no time frame to find in log valuable information on logs.

          Show
          Yago Riveiro added a comment - Hi, I hit this bug with solr 4.5.1 replica 1: lastModified:20 minutes ago version:80616 numDocs:6072661 maxDoc:6072841 deletedDocs:180 replica 2 (leader) lastModified:20 minutes ago version:77595 numDocs:6072575 maxDoc:6072771 deletedDocs:196 I don't know when this happened, therefore I have no time frame to find in log valuable information on logs.
          Hide
          Yago Riveiro added a comment - - edited

          I attached some screenshots

          The shard is the shard11:

          1 - clusterstate: this screenshot shows replica2 192.168.20.104 as the leader
          2 - the replica 2 has lower gen that replica1 and is the leader, is this correct?

          Show
          Yago Riveiro added a comment - - edited I attached some screenshots The shard is the shard11: 1 - clusterstate: this screenshot shows replica2 192.168.20.104 as the leader 2 - the replica 2 has lower gen that replica1 and is the leader, is this correct?
          Hide
          Markus Jelsma added a comment -

          I can confirm as well that is issue still exists. Since yesterday one of a shard's replica has one document less than it should have. Solr doesn't notice this and makes no attempt in recovering this issue. Around the time when i noticed it first we were shutting down and restarting nodes, it's likely that at that time some documents got indexed as well.

          Show
          Markus Jelsma added a comment - I can confirm as well that is issue still exists. Since yesterday one of a shard's replica has one document less than it should have. Solr doesn't notice this and makes no attempt in recovering this issue. Around the time when i noticed it first we were shutting down and restarting nodes, it's likely that at that time some documents got indexed as well.
          Hide
          Mark Miller added a comment -

          Yeah, I know that this can still happen - unfortunately, the debugging and testing required to make continued improvements requires a lot of time, so I don't personally know when I can work on hardening it. Currently, if shards eventually get out of whack, the best you can do is trigger a new recovery against the leader.

          Show
          Mark Miller added a comment - Yeah, I know that this can still happen - unfortunately, the debugging and testing required to make continued improvements requires a lot of time, so I don't personally know when I can work on hardening it. Currently, if shards eventually get out of whack, the best you can do is trigger a new recovery against the leader.
          Hide
          Mark Miller added a comment -

          I have done a bit of work on making it easier to turn on some debug logging that I use for this type of thing recently. I also have some specific local Jenkins jobs I run on dedicated hardware to help track down problems - I have been collecting a lot of logs over the last few weeks. There is a lot more that needs to be done though. I'm hoping to start a wiki page on how I have gone about tracking this type of thing down in the past so that perhaps it's easier for others to get involved. Hopefully Yonik can add any of his useful tricks to that as well.

          Show
          Mark Miller added a comment - I have done a bit of work on making it easier to turn on some debug logging that I use for this type of thing recently. I also have some specific local Jenkins jobs I run on dedicated hardware to help track down problems - I have been collecting a lot of logs over the last few weeks. There is a lot more that needs to be done though. I'm hoping to start a wiki page on how I have gone about tracking this type of thing down in the past so that perhaps it's easier for others to get involved. Hopefully Yonik can add any of his useful tricks to that as well.
          Hide
          Yago Riveiro added a comment -

          Currently, if shards eventually get out of whack, the best you can do is trigger a new recovery against the leader.

          What happen when the leader is the shard with less docs? Is the replication done in the right way?

          Show
          Yago Riveiro added a comment - Currently, if shards eventually get out of whack, the best you can do is trigger a new recovery against the leader. What happen when the leader is the shard with less docs? Is the replication done in the right way?
          Hide
          Mark Miller added a comment -

          What happen when the leader is the shard with less docs? Is the replication done in the right way?

          It really depends on how careful you are trying to be and what you can count on - you don't know if the leader is behind if deletes are involved. The safest thing to do is to stop the cluster and start it again - that triggers a process that tries to pick the most up to date replica and trades up to 100 updates or so among each other if some are on some replicas and not others.

          If you are sure the leader is simply behind, you can just bounce it and let a replica take over as leader.

          Show
          Mark Miller added a comment - What happen when the leader is the shard with less docs? Is the replication done in the right way? It really depends on how careful you are trying to be and what you can count on - you don't know if the leader is behind if deletes are involved. The safest thing to do is to stop the cluster and start it again - that triggers a process that tries to pick the most up to date replica and trades up to 100 updates or so among each other if some are on some replicas and not others. If you are sure the leader is simply behind, you can just bounce it and let a replica take over as leader.
          Hide
          Jessica Cheng Mallet added a comment -

          We're seeing the same thing, running v4.5.0.

          Mark, if you don't mind clarifying...

          The safest thing to do is to stop the cluster and start it again - that triggers a process that tries to pick the most up to date replica and trades up to 100 updates or so among each other if some are on some replicas and not others.

          What if the difference is greater than 100? Is there any other way to figure out who is the "truth" and force that state onto the other replicas by doing a full sync?

          If you are sure the leader is simply behind, you can just bounce it and let a replica take over as leader.

          Newbie question: Why would the leader be behind? Aren't all updates sent to the leader first and then the leader distribute it to the replicas? Also, I was under the impression that this update call is synchronous, so once an update request returns successfully to the client, why would any replica be behind?

          Show
          Jessica Cheng Mallet added a comment - We're seeing the same thing, running v4.5.0. Mark, if you don't mind clarifying... The safest thing to do is to stop the cluster and start it again - that triggers a process that tries to pick the most up to date replica and trades up to 100 updates or so among each other if some are on some replicas and not others. What if the difference is greater than 100? Is there any other way to figure out who is the "truth" and force that state onto the other replicas by doing a full sync? If you are sure the leader is simply behind, you can just bounce it and let a replica take over as leader. Newbie question: Why would the leader be behind? Aren't all updates sent to the leader first and then the leader distribute it to the replicas? Also, I was under the impression that this update call is synchronous, so once an update request returns successfully to the client, why would any replica be behind?
          Hide
          Yago Riveiro added a comment -

          Jessica,

          In some point of the process the leader can be downgraded to replica, the other replica whit less document will become the leader, in this case, the older leader (after the recovery) can be updated as usual and you get the leader behind the replica if the recovery doesn't fix the desviation.

          Show
          Yago Riveiro added a comment - Jessica, In some point of the process the leader can be downgraded to replica, the other replica whit less document will become the leader, in this case, the older leader (after the recovery) can be updated as usual and you get the leader behind the replica if the recovery doesn't fix the desviation.
          Hide
          Mark Miller added a comment -

          What if the difference is greater than 100? Is there any other way to figure out who is the "truth" and force that state onto the other replicas by doing a full sync?

          That is basically what should happen - everyone in the leader line will "try" and become the leader by trying to peer sync with everyone else - either they will be ahead of everyone else and the sync will succeed or they will be behind by less than 100 updates and trade and the sync will succeed. If the sync fails, the next guy in line tries. Eventually the most up to date guy should succeed and he forces everyone else to match him. That is the idea anyway.

          Newbie question: Why would the leader be behind?

          ZooKeeper session timeouts (due to load, gc, whatever) can cause the leader to be bumped.

          You mainly only expect this stuff to happen if nodes go down (and perhaps come back) or session expirations.

          Unfortunately, for a while between 4.4 and 4.5, a couple of our important tests stopped working and I think a couple problems were introduced. I hope to have more time to look into it soon.

          Show
          Mark Miller added a comment - What if the difference is greater than 100? Is there any other way to figure out who is the "truth" and force that state onto the other replicas by doing a full sync? That is basically what should happen - everyone in the leader line will "try" and become the leader by trying to peer sync with everyone else - either they will be ahead of everyone else and the sync will succeed or they will be behind by less than 100 updates and trade and the sync will succeed. If the sync fails, the next guy in line tries. Eventually the most up to date guy should succeed and he forces everyone else to match him. That is the idea anyway. Newbie question: Why would the leader be behind? ZooKeeper session timeouts (due to load, gc, whatever) can cause the leader to be bumped. You mainly only expect this stuff to happen if nodes go down (and perhaps come back) or session expirations. Unfortunately, for a while between 4.4 and 4.5, a couple of our important tests stopped working and I think a couple problems were introduced. I hope to have more time to look into it soon.
          Hide
          Jessica Cheng Mallet added a comment -

          Thanks Yago and Mark. I really appreciate you guys spending time to answer my questions--it's definitely helping me understand more.

          I understand how leader roles can change, but not how replicas can be behind--I thought the updates are synchronously distributed (i.e. solrj's request doesn't return successfully until an update has been distributed to all replicas). Is this not the case?

          If any replica can fall behind, and it can be elected leader without having caught up, wouldn't we possibly end up in the following situation (where let's pretend docs are represented by monotonically increasing numbers):

          Old Leader: 1 2 3 4 5 (GC)
          New Leader: 1 2 3 .. (elected) 6 7

          In this case, who's considered the most up-to-date guy? Would they figure out among themselves that Old Leader is missing 6 7 but the New Leader is missing 4 5? If so, how do they do the right "merge" if the difference is greater than 100 and they have to resort to full sync?

          Show
          Jessica Cheng Mallet added a comment - Thanks Yago and Mark. I really appreciate you guys spending time to answer my questions--it's definitely helping me understand more. I understand how leader roles can change, but not how replicas can be behind--I thought the updates are synchronously distributed (i.e. solrj's request doesn't return successfully until an update has been distributed to all replicas). Is this not the case? If any replica can fall behind, and it can be elected leader without having caught up, wouldn't we possibly end up in the following situation (where let's pretend docs are represented by monotonically increasing numbers): Old Leader: 1 2 3 4 5 (GC) New Leader: 1 2 3 .. (elected) 6 7 In this case, who's considered the most up-to-date guy? Would they figure out among themselves that Old Leader is missing 6 7 but the New Leader is missing 4 5? If so, how do they do the right "merge" if the difference is greater than 100 and they have to resort to full sync?
          Hide
          Yago Riveiro added a comment -

          I thought the updates are synchronously distributed

          My knowledge about how replication is done is very limited, for me replication is a distributed HTTP requests to all replicas, if all responses return the code 200, then the insertion was successful. I don't know if internally the 200 is returned when the document is written on tlog or in the open segment.

          Up-to-date in this case is none, you have your data compromised, you can't guarantee wich is the correct replica, the logic could be pick the replica with more docs and make a new replica using it, but still can know without check one by one if you have all data. An extreme case can be do a full reindex of the data (if you can).

          Show
          Yago Riveiro added a comment - I thought the updates are synchronously distributed My knowledge about how replication is done is very limited, for me replication is a distributed HTTP requests to all replicas, if all responses return the code 200, then the insertion was successful. I don't know if internally the 200 is returned when the document is written on tlog or in the open segment. Up-to-date in this case is none, you have your data compromised, you can't guarantee wich is the correct replica, the logic could be pick the replica with more docs and make a new replica using it, but still can know without check one by one if you have all data. An extreme case can be do a full reindex of the data (if you can).
          Hide
          Mark Miller added a comment -

          I understand how leader roles can change, but not how replicas can be behind--I thought the updates are synchronously distributed (i.e. solrj's request doesn't return successfully until an update has been distributed to all replicas). Is this not the case?

          They could get behind because of a bug generally

          Show
          Mark Miller added a comment - I understand how leader roles can change, but not how replicas can be behind--I thought the updates are synchronously distributed (i.e. solrj's request doesn't return successfully until an update has been distributed to all replicas). Is this not the case? They could get behind because of a bug generally
          Hide
          Yago Riveiro added a comment -

          Mark,

          I can confirm that I had session expirations in my logs in some point of time. My index rate is high and some times my boxes are under some "pressure".

          My problem is that I don't know how deal with the situation. I'm using a non java client and I don't know how I can do debug or the tools that I can use to give some information to help debug this issue.

          Show
          Yago Riveiro added a comment - Mark, I can confirm that I had session expirations in my logs in some point of time. My index rate is high and some times my boxes are under some "pressure". My problem is that I don't know how deal with the situation. I'm using a non java client and I don't know how I can do debug or the tools that I can use to give some information to help debug this issue.
          Hide
          Jessica Cheng Mallet added a comment -

          They could get behind because of a bug generally

          I see, so as designed, the scenario I described shouldn't really happen, because New Leader wouldn't have been missing 4 5?

          BTW, for our case, turns out the version numbers across the replicas matched even though numDocs didn't. That seems to suggest that there’s at least an issue somewhere in the replication flow where it’s possible to update the version but not have the matching documents. Not sure if this is a useful piece of information to you but thought I'd mention.

          Show
          Jessica Cheng Mallet added a comment - They could get behind because of a bug generally I see, so as designed, the scenario I described shouldn't really happen, because New Leader wouldn't have been missing 4 5? BTW, for our case, turns out the version numbers across the replicas matched even though numDocs didn't. That seems to suggest that there’s at least an issue somewhere in the replication flow where it’s possible to update the version but not have the matching documents. Not sure if this is a useful piece of information to you but thought I'd mention.
          Hide
          Mark Miller added a comment -

          I think of two failure scenarios:

          1. A replica goes down. Design solution: it comes back and uses peer sync or replication to catch up.

          2. A leader goes down. It might have been in the middle of sending updates and somehow a couple didn't make it to a replica. Design solution: that leader peer sync dance I talk about above.

          Basically, no one should ever be behind by more than 100 docs (a leader sends updates to replicas in parallel), and new leaders should always end up up to date. Obviously a bit more hardening to do though. Could also use more targeted testing - we count a lot on the chaosmonkey tests for this (those are the tests that stopped working correctly for a while).

          A wrinkle is that zookeeper session timeouts also trigger the same thing as if the node had died - it comes back when the session is reestablished.

          Show
          Mark Miller added a comment - I think of two failure scenarios: 1. A replica goes down. Design solution: it comes back and uses peer sync or replication to catch up. 2. A leader goes down. It might have been in the middle of sending updates and somehow a couple didn't make it to a replica. Design solution: that leader peer sync dance I talk about above. Basically, no one should ever be behind by more than 100 docs (a leader sends updates to replicas in parallel), and new leaders should always end up up to date. Obviously a bit more hardening to do though. Could also use more targeted testing - we count a lot on the chaosmonkey tests for this (those are the tests that stopped working correctly for a while). A wrinkle is that zookeeper session timeouts also trigger the same thing as if the node had died - it comes back when the session is reestablished.
          Hide
          Mark Miller added a comment -

          Basically, no one should ever be behind by more than 100 docs

          Of course if they are over 100 updates behind, they won't successfully sync and become leader, someone else will, and the behind node will be asked to catch up to the leader via replication.

          Show
          Mark Miller added a comment - Basically, no one should ever be behind by more than 100 docs Of course if they are over 100 updates behind, they won't successfully sync and become leader, someone else will, and the behind node will be asked to catch up to the leader via replication.
          Hide
          Mark Miller added a comment -

          My problem is that I don't know how deal with the situation. I'm using a non java client and I don't know how I can do debug or the tools that I can use to give some information to help debug this issue.

          I'm hoping to put together a guide on debugging some of the tests for this sort of thing soon. Perhaps some of that will also be useful for debugging a live installation. I sure could use the help - I have a lot on my plate at the moment.

          Show
          Mark Miller added a comment - My problem is that I don't know how deal with the situation. I'm using a non java client and I don't know how I can do debug or the tools that I can use to give some information to help debug this issue. I'm hoping to put together a guide on debugging some of the tests for this sort of thing soon. Perhaps some of that will also be useful for debugging a live installation. I sure could use the help - I have a lot on my plate at the moment.
          Hide
          Jessica Cheng Mallet added a comment -

          Looking at the code a bit, I realized that the scenario I described can in fact happen if the Old Leader dies (or somehow becomes unreachable, for example due to tripping the kernel SYN flood detection, as ours did), because looks like during runLeaderProcess(), the sync that's run is called with cantReachIsSuccess=true. Since the New Leader can't reach Old Leader, it won't find out about 4 5 (assuming no other replicas have it either), but will successfully "sync" and become the new leader. This can be remedied if the "// TODO: optionally fail if n replicas are not reached..." on DistributedUpdateProcessor.doFinish() is implemented so that at least another replica must have 4 5 before the request would have been ack'd to the user, but of course if New Leader can't reach this other replica either then it's not much help.

          I feel like in general the code may be trying too hard to find a new leader to take over, thereby compromising data consistency. This is probably the right thing to do for many, if not most, search solutions. However, if Solr is indeed moving toward being a possible NoSql solution or for use cases where reindexing the entire corpus is extremely expensive, then maybe a more consistent mode can be implemented where user can choose to trade availability for consistency.

          Show
          Jessica Cheng Mallet added a comment - Looking at the code a bit, I realized that the scenario I described can in fact happen if the Old Leader dies (or somehow becomes unreachable, for example due to tripping the kernel SYN flood detection, as ours did), because looks like during runLeaderProcess(), the sync that's run is called with cantReachIsSuccess=true. Since the New Leader can't reach Old Leader, it won't find out about 4 5 (assuming no other replicas have it either), but will successfully "sync" and become the new leader. This can be remedied if the "// TODO: optionally fail if n replicas are not reached..." on DistributedUpdateProcessor.doFinish() is implemented so that at least another replica must have 4 5 before the request would have been ack'd to the user, but of course if New Leader can't reach this other replica either then it's not much help. I feel like in general the code may be trying too hard to find a new leader to take over, thereby compromising data consistency. This is probably the right thing to do for many, if not most, search solutions. However, if Solr is indeed moving toward being a possible NoSql solution or for use cases where reindexing the entire corpus is extremely expensive, then maybe a more consistent mode can be implemented where user can choose to trade availability for consistency.
          Hide
          Mark Miller added a comment -

          Since the New Leader can't reach Old Leader, it won't find out about 4 5 (assuming no other replicas have it either)

          This shouldn't be the case, because those updates will only have been ack'd if each replica received them. And if they were not ack'd a success, we don't care if we keep them - we just want to get consistent.

          I feel like in general the code may be trying too hard to find a new leader to take over

          A further protection is that a node will not become leader unless it's last state was active.

          I'm not convinced it's too loose - I do know that the impl could use additional love and tests.

          Show
          Mark Miller added a comment - Since the New Leader can't reach Old Leader, it won't find out about 4 5 (assuming no other replicas have it either) This shouldn't be the case, because those updates will only have been ack'd if each replica received them. And if they were not ack'd a success, we don't care if we keep them - we just want to get consistent. I feel like in general the code may be trying too hard to find a new leader to take over A further protection is that a node will not become leader unless it's last state was active. I'm not convinced it's too loose - I do know that the impl could use additional love and tests.
          Hide
          Jessica Cheng Mallet added a comment -

          This shouldn't be the case, because those updates will only have been ack'd if each replica received them.

          That's what I thought too, but doesn't seem to be the case in the code. If you take a look at DistributedUpdateProcessor.doFinish(),

          // if its a forward, any fail is a problem -
          // otherwise we assume things are fine if we got it locally
          // until we start allowing min replication param
          if (errors.size() > 0) {
          // if one node is a RetryNode, this was a forward request
          if (errors.get(0).req.node instanceof RetryNode)

          Unknown macro: { rsp.setException(errors.get(0).e); }

          // else
          // for now we don't error - we assume if it was added locally, we
          // succeeded
          }

          It then starts a thread to urge the replica to recover, but if that fails, it just completely gives up.

          Show
          Jessica Cheng Mallet added a comment - This shouldn't be the case, because those updates will only have been ack'd if each replica received them. That's what I thought too, but doesn't seem to be the case in the code. If you take a look at DistributedUpdateProcessor.doFinish(), // if its a forward, any fail is a problem - // otherwise we assume things are fine if we got it locally // until we start allowing min replication param if (errors.size() > 0) { // if one node is a RetryNode, this was a forward request if (errors.get(0).req.node instanceof RetryNode) Unknown macro: { rsp.setException(errors.get(0).e); } // else // for now we don't error - we assume if it was added locally, we // succeeded } It then starts a thread to urge the replica to recover, but if that fails, it just completely gives up.
          Hide
          Mark Miller added a comment -

          Right - but that's just impl, not design. The idea is that, since we add locally first, there is not much reason it should fail on a replica - unless that replica has crashed or lost connectivity or something really bad. In that case, it will have to reconnect to zk and recover or restart and recover. Just in case, as a precaution, we try and tell it to recover - then if it's still got connectivity or it was an intermittent problem, it won't run around acting active. I think I have a note about perhaps doing more retries in background threads for that recovery request, but I've never gotten to it.

          If you are finding a scenario that eludes that, we should strengthen the impl.

          Show
          Mark Miller added a comment - Right - but that's just impl, not design. The idea is that, since we add locally first, there is not much reason it should fail on a replica - unless that replica has crashed or lost connectivity or something really bad. In that case, it will have to reconnect to zk and recover or restart and recover. Just in case, as a precaution, we try and tell it to recover - then if it's still got connectivity or it was an intermittent problem, it won't run around acting active. I think I have a note about perhaps doing more retries in background threads for that recovery request, but I've never gotten to it. If you are finding a scenario that eludes that, we should strengthen the impl.
          Hide
          Mark Miller added a comment -

          This could be related to SOLR-5397

          Show
          Mark Miller added a comment - This could be related to SOLR-5397
          Hide
          Mark Miller added a comment -

          Would love if you guys could try with 4.6 and report back. SOLR-5397 was introduced when we fixed a similar issue, so that has really been an issue for a few releases.

          Show
          Mark Miller added a comment - Would love if you guys could try with 4.6 and report back. SOLR-5397 was introduced when we fixed a similar issue, so that has really been an issue for a few releases.
          Hide
          Markus Jelsma added a comment -

          I updated our machines to include SOLR-5397. Everything works fine now, it may take quite some time before we can say it is fixed

          Show
          Markus Jelsma added a comment - I updated our machines to include SOLR-5397 . Everything works fine now, it may take quite some time before we can say it is fixed
          Hide
          Yago Riveiro added a comment -

          It's safe upgrade from 4.5.1 to 4.6?. I have docValues and I read that it's not linear upgraded and I can't reindex the data.

          Show
          Yago Riveiro added a comment - It's safe upgrade from 4.5.1 to 4.6?. I have docValues and I read that it's not linear upgraded and I can't reindex the data.
          Hide
          Mark Miller added a comment -

          According to the wiki, it depends on the doc values impl you are using - the default one will upgrade fine. Others require that you forceMerge your index to rewrite it with the default and then upgrade, then I guess you can forceMerge back to that impl. Honestly, I have not had a chance to play with doc values yet though.

          Show
          Mark Miller added a comment - According to the wiki, it depends on the doc values impl you are using - the default one will upgrade fine. Others require that you forceMerge your index to rewrite it with the default and then upgrade, then I guess you can forceMerge back to that impl. Honestly, I have not had a chance to play with doc values yet though.
          Hide
          Yago Riveiro added a comment -

          I'm using <codecFactory class="solr.SchemaCodecFactory"/> to enable per-field DocValues formats.

          I think that this aspect about docValues it doesn't explained on wiki in a proper way. There is no example how we can do the switch to default, do the forceMerge and switch back to the original implementation.

          If I can't have the security that all will work fine, I can't do the upgrade.

          Show
          Yago Riveiro added a comment - I'm using <codecFactory class="solr.SchemaCodecFactory"/> to enable per-field DocValues formats. I think that this aspect about docValues it doesn't explained on wiki in a proper way. There is no example how we can do the switch to default, do the forceMerge and switch back to the original implementation. If I can't have the security that all will work fine, I can't do the upgrade.
          Hide
          Mark Miller added a comment -

          Should probably bring it up on the user list - we need someone like Robert Muir to weigh in. I assume it all works the same way - you merge each field to the default impl and then back to what they were.

          Show
          Mark Miller added a comment - Should probably bring it up on the user list - we need someone like Robert Muir to weigh in. I assume it all works the same way - you merge each field to the default impl and then back to what they were.
          Hide
          Markus Jelsma added a comment -

          I've got some bad news, it happened again on one of our clusters using a build of november 19th.Three replica's went out of sync.

          Show
          Markus Jelsma added a comment - I've got some bad news, it happened again on one of our clusters using a build of november 19th.Three replica's went out of sync.
          Hide
          Rafał Kuć added a comment -

          Happened to me two, collection with two four shards, each having a single replica. The replicas were our of sync.

          Show
          Rafał Kuć added a comment - Happened to me two, collection with two four shards, each having a single replica. The replicas were our of sync.
          Hide
          Mark Miller added a comment -

          What's the exact version / checkout?

          Show
          Mark Miller added a comment - What's the exact version / checkout?
          Hide
          Mark Miller added a comment -

          Markus Jelsma, hopefully that's SOLR-5516 then.

          Show
          Mark Miller added a comment - Markus Jelsma , hopefully that's SOLR-5516 then.
          Hide
          Markus Jelsma added a comment -

          I'll check it out!

          Show
          Markus Jelsma added a comment - I'll check it out!
          Hide
          Markus Jelsma added a comment -

          I'm sorry, i've got three replica's having one document less than the leader. We're on a december, 3th build.

          Show
          Markus Jelsma added a comment - I'm sorry, i've got three replica's having one document less than the leader. We're on a december, 3th build.
          Hide
          Yago Riveiro added a comment -

          Replicas are still losing docs in Solr 4.6 .

          I'm wondering if we can't have a pair (version, numDocs) to track the increments of docs between versions. Also we can save the last 10 tlogs in each replica as backups after be commited and make a diff to see what is missing in case the replicas are out of sync, replay the transaction and avoid a not synchronized replica and a full-recovery that probably will be heaviest that make the diff.

          It's only and idea and of course find the bug must be the priority.

          This issue compromisse Solr to be "the main" storage. If re-index data is not possible, we can't guarantee that no data is missing, and worse, we lost the data forever .

          Show
          Yago Riveiro added a comment - Replicas are still losing docs in Solr 4.6 . I'm wondering if we can't have a pair (version, numDocs) to track the increments of docs between versions. Also we can save the last 10 tlogs in each replica as backups after be commited and make a diff to see what is missing in case the replicas are out of sync, replay the transaction and avoid a not synchronized replica and a full-recovery that probably will be heaviest that make the diff. It's only and idea and of course find the bug must be the priority. This issue compromisse Solr to be "the main" storage. If re-index data is not possible, we can't guarantee that no data is missing, and worse, we lost the data forever .
          Hide
          Mark Miller added a comment -

          I've fixed some things since 4.6 - I only had time to focus on the leader not going down case for 4.6, I spent a bunch more time on this case after 4.6 was released. Unfortunately, I think there are a couple of issues at play here - some of the new changes makes existing holes easier to spot and the chaos monkey tests where accidentally disabled for some time, so small issues may have crept in.

          I think the remaining issue is mostly around SOLR-5516. Need to come up with a better idea than a really long wait though - but if someone wants to help test, putting in a long wait and stressing this would be useful to see if it is indeed the main remaining issue.

          I recently put in a lot of time improving the situation and I need to focus on other things for a bit, but that I'll keep coming back to this as I can.

          Show
          Mark Miller added a comment - I've fixed some things since 4.6 - I only had time to focus on the leader not going down case for 4.6, I spent a bunch more time on this case after 4.6 was released. Unfortunately, I think there are a couple of issues at play here - some of the new changes makes existing holes easier to spot and the chaos monkey tests where accidentally disabled for some time, so small issues may have crept in. I think the remaining issue is mostly around SOLR-5516 . Need to come up with a better idea than a really long wait though - but if someone wants to help test, putting in a long wait and stressing this would be useful to see if it is indeed the main remaining issue. I recently put in a lot of time improving the situation and I need to focus on other things for a bit, but that I'll keep coming back to this as I can.
          Hide
          Timothy Potter added a comment -

          I have some cycles to work on this issue over the next couple of days. I'm starting by trying to reproduce it in my environment. Please let me know of any tasks that I can help out on (beyond the long wait stuff you mentioned above).

          Show
          Timothy Potter added a comment - I have some cycles to work on this issue over the next couple of days. I'm starting by trying to reproduce it in my environment. Please let me know of any tasks that I can help out on (beyond the long wait stuff you mentioned above).
          Hide
          Timothy Potter added a comment -

          I don't have fix yet, but I wanted to post an update here to get some feedback on what I'm seeing ...

          I have a simple SolrCloud configuration setup locally: 1 collection named "cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create it:
          curl "http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud"

          The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 with cloud84 being assigned the leader.

          Here's an outline of the process I used to get my collection out-of-sync during indexing:

          1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to the leader and replica remains in sync for as long as I let this process run
          2) kill -9 the process for the replica cloud85
          3) let indexing continue against cloud84 for a few seconds (just to get the leader and replica out-of-sync once I bring the replica back online)
          4) kill -9 the process for the leader cloud84 ... indexing halts of course as there are no running servers
          5) start the replica cloud85 but do not start the previous leader cloud84

          Here are some key log messages as cloud85 - the replica - fires up ... my annotations of the log messages are prefixed by [TJP >>

          2013-12-11 11:43:22,076 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 1)
          2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=139841

          [TJP >> This looks good and is expected because cloud85 was not the leader before it died, so it should not immediately assume it is the leader until it sees more replicas

          6) now start the previous leader cloud84 ...

          Here are some key log messages from cloud85 as the previous leader cloud84 is coming up ...

          2013-12-11 11:43:24,085 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (2)
          2013-12-11 11:43:24,136 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
          2013-12-11 11:43:24,137 [Thread-13] INFO common.cloud.ZkStateReader - Updating cloud state from ZooKeeper...
          2013-12-11 11:43:24,138 [Thread-13] INFO solr.cloud.Overseer - Update state numShards=1 message=

          { "operation":"state", "state":"down", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica2", "roles":null, "node_name":"cloud84:8984_solr", "shard":"shard1", "shard_range":null, "shard_state":"active", "shard_parent":null, "collection":"cloud", "numShards":"1", "core_node_name":"core_node1"}

          [TJP >> state of cloud84 looks correct as it is still initializing ...

          2013-12-11 11:43:24,140 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
          2013-12-11 11:43:24,141 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2)

          2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue.

          [TJP >> hmmmm ... cloud84 is listed in /live_nodes but it isn't "active" yet or even recovering (see state above - it's currently "down") ... My thinking here is that the ShardLeaderElectionContext needs to take the state of the replica into account before deciding it should continue.

          2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync
          2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync replicas to http://cloud85:8985/solr/cloud_shard1_replica1/
          2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr START replicas=http://cloud84:8984/solr/cloud_shard1_replica2/ nUpdates=100
          2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success

          [TJP >> whoops! of course it couldn't connect to cloud84 as it's still initializing ...

          2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr DONE. sync succeeded
          2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync Success - now sync replicas to me
          2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: try and ask http://cloud84:8984/solr/cloud_shard1_replica2/ to sync
          2013-12-11 11:43:25,938 [coreLoadExecutor-3-thread-1] ERROR solr.cloud.SyncStrategy - Sync request error: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://cloud84:8984/solr/cloud_shard1_replica2

          [TJP >> ayep, cloud84 is still initializing so it can't respond to you Mr. Impatient cloud85!

          2013-12-11 11:43:25,939 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: Sync failed - asking replica (http://cloud84:8984/solr/cloud_shard1_replica2/) to recover.
          2013-12-11 11:43:25,940 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I am the new leader: http://cloud85:8985/solr/cloud_shard1_replica1/ shard1

          [TJP >> oh no! the collection is now out-of-sync ... my test harness periodically polls the replicas for their doc counts and at this point, we ended up with:
          shard1:

          { http://cloud85:8985/solr/cloud_shard1_replica1/ = 300800 LEADER http://cloud84:8984/solr/cloud_shard1_replica2/ = 447600 diff:-146800 <--- this should be the real leader! }

          Which of course is expected because cloud85 should NOT be the leader

          So all that is interesting, but how to fix???

          My first idea was to go tackle the decision making process ShardLeaderElectionContext uses to decide if it has enough replicas to continue.

          It's easy enough to do something like the following:
          int notDownCount = 0;
          Map<String,Replica> replicasMap = slices.getReplicasMap();
          for (Replica replica : replicasMap.values()) {
          ZkCoreNodeProps replicaCoreProps = new ZkCoreNodeProps(replica);
          String replicaState = replicaCoreProps.getState();
          log.warn(">>>> State of replica "replica.getName()" is "replicaState" <<<<");
          if ("active".equals(replicaState) || "recovering".equals(replicaState))

          { ++notDownCount; }

          }

          Was thinking I could use the notDownCount to make a better decision, but then I ran into another issue related to replica state being stale. In my cluster, if I have /clusterstate.json:

          {"cloud":{
          "shards":{"shard1":{
          "range":"80000000-7fffffff",
          "state":"active",
          "replicas":{
          "core_node1":

          { "state":"active", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica2", "node_name":"cloud84:8984_solr", "leader":"true"}

          ,
          "core_node2":{
          "state":"active",
          "base_url":"http://cloud85:8985/solr",
          "core":"cloud_shard1_replica1",
          "node_name":"cloud85:8985_solr"}}}},
          "maxShardsPerNode":"1",
          "router":

          {"name":"compositeId"}

          ,
          "replicationFactor":"2"}}

          If I kill the process using kill -9 PID for the Solr running on 8985 (the replica), core_node2's state remains "active" in /clusterstate.json

          When tailing the log on core_node1, I do see one notification coming in the watcher setup by ZkStateReader from ZooKeeper about live nodes having changed:
          2013-12-11 15:42:46,010 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (1)

          So after killing the process, /live_nodes is updated to only have one node, but /clusterstate.json still thinks there are 2 healthy replicas for shard1, instead of just 1.

          Of course, if I restart 8985, then it goes through a series of state changes until it is marked active again, which looks correct.

          Bottom line ... it seems there is something in SolrCloud that does not update a replica's state when the node is killed. If a change to /live_nodes doesn't trigger a refresh of replica state, what does?

          I'm seeing this stale replica state issue in Solr 4.6.0 and in revision 1550300 of branch_4x - the latest from svn.

          Not having a fresh state of a replica prevents my idea for fixing ShardLeaderElectionContext's decision making process. I'm also curious about the decision to register a node under /live_nodes before it is fully initialized, but maybe that is a discussion for another time.

          In any case, I wanted to get some feedback on my findings before moving forward with a solution.

          Show
          Timothy Potter added a comment - I don't have fix yet, but I wanted to post an update here to get some feedback on what I'm seeing ... I have a simple SolrCloud configuration setup locally: 1 collection named "cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create it: curl "http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud" The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 with cloud84 being assigned the leader. Here's an outline of the process I used to get my collection out-of-sync during indexing: 1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to the leader and replica remains in sync for as long as I let this process run 2) kill -9 the process for the replica cloud85 3) let indexing continue against cloud84 for a few seconds (just to get the leader and replica out-of-sync once I bring the replica back online) 4) kill -9 the process for the leader cloud84 ... indexing halts of course as there are no running servers 5) start the replica cloud85 but do not start the previous leader cloud84 Here are some key log messages as cloud85 - the replica - fires up ... my annotations of the log messages are prefixed by [TJP >> 2013-12-11 11:43:22,076 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 1) 2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=139841 [TJP >> This looks good and is expected because cloud85 was not the leader before it died, so it should not immediately assume it is the leader until it sees more replicas 6) now start the previous leader cloud84 ... Here are some key log messages from cloud85 as the previous leader cloud84 is coming up ... 2013-12-11 11:43:24,085 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (2) 2013-12-11 11:43:24,136 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2013-12-11 11:43:24,137 [Thread-13] INFO common.cloud.ZkStateReader - Updating cloud state from ZooKeeper... 2013-12-11 11:43:24,138 [Thread-13] INFO solr.cloud.Overseer - Update state numShards=1 message= { "operation":"state", "state":"down", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica2", "roles":null, "node_name":"cloud84:8984_solr", "shard":"shard1", "shard_range":null, "shard_state":"active", "shard_parent":null, "collection":"cloud", "numShards":"1", "core_node_name":"core_node1"} [TJP >> state of cloud84 looks correct as it is still initializing ... 2013-12-11 11:43:24,140 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2013-12-11 11:43:24,141 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2) 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue. [TJP >> hmmmm ... cloud84 is listed in /live_nodes but it isn't "active" yet or even recovering (see state above - it's currently "down") ... My thinking here is that the ShardLeaderElectionContext needs to take the state of the replica into account before deciding it should continue. 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync replicas to http://cloud85:8985/solr/cloud_shard1_replica1/ 2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url= http://cloud85:8985/solr START replicas= http://cloud84:8984/solr/cloud_shard1_replica2/ nUpdates=100 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url= http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/ , counting as success [TJP >> whoops! of course it couldn't connect to cloud84 as it's still initializing ... 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url= http://cloud85:8985/solr DONE. sync succeeded 2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync Success - now sync replicas to me 2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: try and ask http://cloud84:8984/solr/cloud_shard1_replica2/ to sync 2013-12-11 11:43:25,938 [coreLoadExecutor-3-thread-1] ERROR solr.cloud.SyncStrategy - Sync request error: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://cloud84:8984/solr/cloud_shard1_replica2 [TJP >> ayep, cloud84 is still initializing so it can't respond to you Mr. Impatient cloud85! 2013-12-11 11:43:25,939 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: Sync failed - asking replica ( http://cloud84:8984/solr/cloud_shard1_replica2/ ) to recover. 2013-12-11 11:43:25,940 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I am the new leader: http://cloud85:8985/solr/cloud_shard1_replica1/ shard1 [TJP >> oh no! the collection is now out-of-sync ... my test harness periodically polls the replicas for their doc counts and at this point, we ended up with: shard1: { http://cloud85:8985/solr/cloud_shard1_replica1/ = 300800 LEADER http://cloud84:8984/solr/cloud_shard1_replica2/ = 447600 diff:-146800 <--- this should be the real leader! } Which of course is expected because cloud85 should NOT be the leader So all that is interesting, but how to fix??? My first idea was to go tackle the decision making process ShardLeaderElectionContext uses to decide if it has enough replicas to continue. It's easy enough to do something like the following: int notDownCount = 0; Map<String,Replica> replicasMap = slices.getReplicasMap(); for (Replica replica : replicasMap.values()) { ZkCoreNodeProps replicaCoreProps = new ZkCoreNodeProps(replica); String replicaState = replicaCoreProps.getState(); log.warn(">>>> State of replica " replica.getName() " is " replicaState " <<<<"); if ("active".equals(replicaState) || "recovering".equals(replicaState)) { ++notDownCount; } } Was thinking I could use the notDownCount to make a better decision, but then I ran into another issue related to replica state being stale. In my cluster, if I have /clusterstate.json: {"cloud":{ "shards":{"shard1":{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "core_node1": { "state":"active", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica2", "node_name":"cloud84:8984_solr", "leader":"true"} , "core_node2":{ "state":"active", "base_url":"http://cloud85:8985/solr", "core":"cloud_shard1_replica1", "node_name":"cloud85:8985_solr"}}}}, "maxShardsPerNode":"1", "router": {"name":"compositeId"} , "replicationFactor":"2"}} If I kill the process using kill -9 PID for the Solr running on 8985 (the replica), core_node2's state remains "active" in /clusterstate.json When tailing the log on core_node1, I do see one notification coming in the watcher setup by ZkStateReader from ZooKeeper about live nodes having changed: 2013-12-11 15:42:46,010 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (1) So after killing the process, /live_nodes is updated to only have one node, but /clusterstate.json still thinks there are 2 healthy replicas for shard1, instead of just 1. Of course, if I restart 8985, then it goes through a series of state changes until it is marked active again, which looks correct. Bottom line ... it seems there is something in SolrCloud that does not update a replica's state when the node is killed. If a change to /live_nodes doesn't trigger a refresh of replica state, what does? I'm seeing this stale replica state issue in Solr 4.6.0 and in revision 1550300 of branch_4x - the latest from svn. Not having a fresh state of a replica prevents my idea for fixing ShardLeaderElectionContext's decision making process. I'm also curious about the decision to register a node under /live_nodes before it is fully initialized, but maybe that is a discussion for another time. In any case, I wanted to get some feedback on my findings before moving forward with a solution.
          Hide
          Mark Miller added a comment -

          A lot there! I'll respond to most of it later.

          As far as the stale state, that is expected. You cannot tell the state just from clusterstate.json - it is a mix of clusterstate.json and the live_nodes list. If the livenode for anything in clusterstate.json is missing, it's considered not up. This is just currently by design - without live_nodes, you don't know the state.

          Show
          Mark Miller added a comment - A lot there! I'll respond to most of it later. As far as the stale state, that is expected. You cannot tell the state just from clusterstate.json - it is a mix of clusterstate.json and the live_nodes list. If the livenode for anything in clusterstate.json is missing, it's considered not up. This is just currently by design - without live_nodes, you don't know the state.
          Hide
          Timothy Potter added a comment -

          ok - cool ... just wanted to make sure that "stale" situation was expected ...

          the more I dig into ShardLeaderElectionContext's decision making process, I think looking at state won't work because both are in the "down" state while this is happening. I think some determination of is the node "reachable" so that PeerSync can get good information from it is what needs to be factored into ShardLeaderElectionContext. Or maybe there is another state "trying to figure out my role in the world as I come back up"

          Show
          Timothy Potter added a comment - ok - cool ... just wanted to make sure that "stale" situation was expected ... the more I dig into ShardLeaderElectionContext's decision making process, I think looking at state won't work because both are in the "down" state while this is happening. I think some determination of is the node "reachable" so that PeerSync can get good information from it is what needs to be factored into ShardLeaderElectionContext. Or maybe there is another state "trying to figure out my role in the world as I come back up"
          Hide
          Mark Miller added a comment -

          The other issue is expected as well. It's the safety mechanism - we don't let you just start one node and let it becomes leader - ideally you want all replicas to be involved in the election to prevent data loss. You have to be explicit if you want to have this work with no wait. It might be nice if we added a startup sys prop that caused it not to wait on first startup.

          Show
          Mark Miller added a comment - The other issue is expected as well. It's the safety mechanism - we don't let you just start one node and let it becomes leader - ideally you want all replicas to be involved in the election to prevent data loss. You have to be explicit if you want to have this work with no wait. It might be nice if we added a startup sys prop that caused it not to wait on first startup.
          Hide
          Timothy Potter added a comment -

          Agreed on the wait being necessary (which I actually annotated in the comment above). The crux of the issue here is that the replica (cloud85) can't sync with the previous leader (cloud84) because they are waiting on each other; much like a dead-lock. Eventually, they both give up and one wins; unfortunately in my test case, cloud85 wins which leads to the shard being out-of-sync because the wrong leader is selected in this scenario (cloud84 should have been selected).

          I'm continuing to dig into this but have come to the conclusion that tweaking the waitForReplicasToComeUp process is a dead end and it's working as well as it can.

          Show
          Timothy Potter added a comment - Agreed on the wait being necessary (which I actually annotated in the comment above). The crux of the issue here is that the replica (cloud85) can't sync with the previous leader (cloud84) because they are waiting on each other; much like a dead-lock. Eventually, they both give up and one wins; unfortunately in my test case, cloud85 wins which leads to the shard being out-of-sync because the wrong leader is selected in this scenario (cloud84 should have been selected). I'm continuing to dig into this but have come to the conclusion that tweaking the waitForReplicasToComeUp process is a dead end and it's working as well as it can.
          Hide
          Mark Miller added a comment -

          because they are waiting on each other;

          That doesn't make sense to me - the wait should be until all the replicas for a shard are up - so what exactly are they both waiting on? If they are both waiting, there should be enough replicas up to continue...

          Show
          Mark Miller added a comment - because they are waiting on each other; That doesn't make sense to me - the wait should be until all the replicas for a shard are up - so what exactly are they both waiting on? If they are both waiting, there should be enough replicas up to continue...
          Hide
          Timothy Potter added a comment -

          I'm sorry for being unclear; "waiting" was probably the wrong term ... and they definitely continue right on down the path of selecting the wrong leader.

          Here's what I know so far, which admittedly isn't much:

          As cloud85 (replica before it crashed) is initializing, it enters the wait process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is expected and a good thing.

          Some short amount of time in the future, cloud84 (leader before it crashed) begins initializing and gets to a point where it adds itself as a possible leader for the shard (by creating a znode under /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 being able to return from waitForReplicasToComeUp and try to determine who should be the leader.

          cloud85 then tries to run the SyncStrategy, which can never work because in this scenario the Jetty HTTP listener is not active yet on either node, so all replication work that uses HTTP requests fails on both nodes ... PeerSync treats these failures as indicators that the other replicas in the shard are unavailable (or whatever) and assumes success. Here's the log message:

          2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success

          The Jetty HTTP listener doesn't start accepting connections until long after this process has completed and already selected the wrong leader.

          From what I can see, we seem to have a leader recovery process that is based partly on HTTP requests to the other nodes, but the HTTP listener on those nodes isn't active yet. We need a leader recovery process that doesn't rely on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader may need to work differently than leader election in a shard that has replicas that can respond to HTTP requests? All of what I'm seeing makes perfect sense for leader election when there are active replicas and the current leader fails.

          All this aside, I'm not asserting that this is the only cause for the out-of-sync issues reported in this ticket, but it definitely seems like it could happen in a real cluster.

          Show
          Timothy Potter added a comment - I'm sorry for being unclear; "waiting" was probably the wrong term ... and they definitely continue right on down the path of selecting the wrong leader. Here's what I know so far, which admittedly isn't much: As cloud85 (replica before it crashed) is initializing, it enters the wait process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is expected and a good thing. Some short amount of time in the future, cloud84 (leader before it crashed) begins initializing and gets to a point where it adds itself as a possible leader for the shard (by creating a znode under /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 being able to return from waitForReplicasToComeUp and try to determine who should be the leader. cloud85 then tries to run the SyncStrategy, which can never work because in this scenario the Jetty HTTP listener is not active yet on either node, so all replication work that uses HTTP requests fails on both nodes ... PeerSync treats these failures as indicators that the other replicas in the shard are unavailable (or whatever) and assumes success. Here's the log message: 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url= http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/ , counting as success The Jetty HTTP listener doesn't start accepting connections until long after this process has completed and already selected the wrong leader. From what I can see, we seem to have a leader recovery process that is based partly on HTTP requests to the other nodes, but the HTTP listener on those nodes isn't active yet. We need a leader recovery process that doesn't rely on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader may need to work differently than leader election in a shard that has replicas that can respond to HTTP requests? All of what I'm seeing makes perfect sense for leader election when there are active replicas and the current leader fails. All this aside, I'm not asserting that this is the only cause for the out-of-sync issues reported in this ticket, but it definitely seems like it could happen in a real cluster.
          Hide
          Mark Miller added a comment -

          Ah, thanks for the explanation. I think we should roll that specific issue into a new JIRA issue.

          Show
          Mark Miller added a comment - Ah, thanks for the explanation. I think we should roll that specific issue into a new JIRA issue.
          Show
          Timothy Potter added a comment - Mark -> https://issues.apache.org/jira/browse/SOLR-5552
          Hide
          Mark Miller added a comment -

          SOLR-5552 investigation has also led to SOLR-5569 and SOLR-5568

          Show
          Mark Miller added a comment - SOLR-5552 investigation has also led to SOLR-5569 and SOLR-5568
          Hide
          Timothy Potter added a comment -

          Found another interesting case that may or may not be valid depending on whether we think HTTP requests between a leader and replica can fail even if the ZooKeeper session on the replica does not drop?

          Specifically, what I'm seeing is that if an update request between the leader and replica fails, but the replica doesn't lose it's session with ZK, then the replica can get out-of-sync with the leader. In a real network partition, the ZK connection would also likely be lost and the replica would get marked as down. So as long as the HTTP connection timeout between the leader and replica exceeds the ZK client timeout, the replica would probably recover correctly, rendering this test case invalid. So maybe the main question here is whether we think it's possible for HTTP requests between a leader and replica to fail even though the ZooKeeper connection stays alive?

          Here's the steps I used to reproduce this case (all using revision 1553150 in branch_4x):

          > STEP 1: Setup a collection named “cloud” containing 1 shard and 2 replicas on hosts: cloud84 (127.0.0.1:8984) and cloud85 (127.0.0.1:8985)

          SOLR_TOP=/home/ec2-user/branch_4x/solr
          $SOLR_TOP/cloud84/cloud-scripts/zkcli.sh -zkhost $ZK_HOST -cmd upconfig -confdir $SOLR_TOP/cloud84/solr/cloud/conf -confname cloud
          API=http://localhost:8984/solr/admin/collections
          curl -v "$API?action=CREATE&name=cloud&replicationFactor=2&numShards=1&collection.configName=cloud"

          Replica on cloud84 is elected as the initial leader. /clusterstate.json looks like:

          {"cloud":{
          "shards":{"shard1":{
          "range":"80000000-7fffffff",
          "state":"active",
          "replicas":{
          "core_node1":

          { "state":"active", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica1", "node_name":"cloud84:8984_solr", "leader":"true"}

          ,
          "core_node2":{
          "state":"active",
          "base_url":"http://cloud85:8985/solr",
          "core":"cloud_shard1_replica2",
          "node_name":"cloud85:8985_solr"}}}},
          "maxShardsPerNode":"1",
          "router":

          {"name":"compositeId"}

          ,
          "replicationFactor":"2"}}

          > STEP 2: Simulate network partition

          sudo iptables -I INPUT 1 -i lo -p tcp --sport 8985 -j DROP; sudo iptables -I INPUT 2 -i lo -p tcp --dport 8985 -j DROP

          Various ways to do this, but to keep it simple, I'm just dropping inbound traffic on localhost to port 8985.

          > STEP 3: Send document with ID “doc1” to leader on cloud84

          curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' \
          --data-binary '<add><doc><field name="id">doc1</field><field name="foo_s">bar</field></doc></add>'

          The update request takes some time because the replica is down but ultimately succeeds on the leader. In the logs on the leader, we have (some stack trace lines removed for clarity):

          2013-12-23 10:59:33,688 [updateExecutor-1-thread-1] ERROR solr.update.StreamingSolrServers - error
          org.apache.http.conn.HttpHostConnectException: Connection to http://cloud85:8985 refused
          at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190)
          ...
          at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:232)
          ...
          Caused by: java.net.ConnectException: Connection timed out
          ...
          2013-12-23 10:59:33,695 [qtp1073932139-16] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica1] webapp=/solr path=/update params={}

          {add=[doc1 (1455228778490888192)]}

          0 63256
          2013-12-23 10:59:33,702 [updateExecutor-1-thread-2] INFO update.processor.DistributedUpdateProcessor - try and ask http://cloud85:8985/solr to recover
          2013-12-23 10:59:48,718 [updateExecutor-1-thread-2] ERROR update.processor.DistributedUpdateProcessor - http://cloud85:8985/solr: Could not tell a replica to recover:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://cloud85:8985/solr
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:507)
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
          at org.apache.solr.update.processor.DistributedUpdateProcessor$1.run(DistributedUpdateProcessor.java:657)
          ...
          Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to cloud85:8985 timed out
          ...

          Of course these log messages are expected. The key is that the leader accepted the update and now has one doc with ID "doc1"

          > STEP 4: Heal the network partition

          sudo service iptables restart (undoes the DROP rules we added above)

          > STEP 5: Send document with ID “doc2” to leader on cloud84

          curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' \
          --data-binary '<add><doc><field name="id">doc2</field><field name="foo_s">bar</field></doc></add>'

          Of course this time the update gets sent successfully to replica ... here are some log messages ...

          from the log on cloud84:
          2013-12-23 11:00:46,982 [qtp1073932139-18] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica1] webapp=/solr path=/update params={}

          {add=[doc2 (1455228921389776896)]} 0 162

          from the log on cloud85 (out-of-sync replica):
          2013-12-23 10:47:26,363 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2)

          ...

          *2013-12-23 11:00:46,979 [qtp2124890785-12] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://cloud84:8984/solr/cloud_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[doc2 (1455228921389776896)]}

          0 142*

          Notice that there is no logged activity on cloud85 between 10:47 and 11:00

          > STEP 6: Commit updates

          curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' --data-binary "<commit waitSearcher=\"true\"/>"

          > STEP 7: Send non-distributed queries to each replica

          curl "http://localhost:8984/solr/cloud/select?q=foo_s:bar&rows=0&wt=json&distrib=false"

          {"responseHeader":{"status":0,"QTime":1,"params":{"q":"foo_s:bar","distrib":"false","wt":"json","rows":"0"}},"response":{"numFound":2,"start":0,"docs":[]}}

          curl "http://localhost:8985/solr/cloud/select?q=foo_s:bar&rows=0&wt=json&distrib=false"

          {"responseHeader":{"status":0,"QTime":1,"params":{"q":"foo_s:bar","distrib":"false","wt":"json","rows":"0"}},"response":{"numFound":1,"start":0,"docs":[]}}

          Observe that the leader has 2 docs and the replica on cloud85 only has 1, but should have 2.

          From what I can tell, the replica that missed some updates because of a temporary network partition doesn't get any notification that it missed some documents. In other words, the out-of-sync replica doesn't know it's out-of-sync and it's state in ZooKeeper is active. As you can see from the log messages I posted in step 3 above, the leader tried to tell the replica to recovery, but due to the network partition, that request got dropped too.

          I'm wondering if the leader should send a state version tracking ID along with each update request so that a replica can detect that it's view of state was stale? I could see the process working as follows:

          1. Shard leader now keeps track of a Slice state version tracking identifier that gets sent with every update request
          2. Leader tries to send an update request (including the state version ID) to a replica and send fails
          3. Leader updates the state version to a different value
          4. Leader sends another update request to the replica; request includes the updated version ID; replica accepts the request but realizes its state version ID is out-of-date from what the leader sent
          5. Replica enters recovery

          Show
          Timothy Potter added a comment - Found another interesting case that may or may not be valid depending on whether we think HTTP requests between a leader and replica can fail even if the ZooKeeper session on the replica does not drop? Specifically, what I'm seeing is that if an update request between the leader and replica fails, but the replica doesn't lose it's session with ZK, then the replica can get out-of-sync with the leader. In a real network partition, the ZK connection would also likely be lost and the replica would get marked as down. So as long as the HTTP connection timeout between the leader and replica exceeds the ZK client timeout, the replica would probably recover correctly, rendering this test case invalid. So maybe the main question here is whether we think it's possible for HTTP requests between a leader and replica to fail even though the ZooKeeper connection stays alive? Here's the steps I used to reproduce this case (all using revision 1553150 in branch_4x): > STEP 1: Setup a collection named “cloud” containing 1 shard and 2 replicas on hosts: cloud84 (127.0.0.1:8984) and cloud85 (127.0.0.1:8985) SOLR_TOP=/home/ec2-user/branch_4x/solr $SOLR_TOP/cloud84/cloud-scripts/zkcli.sh -zkhost $ZK_HOST -cmd upconfig -confdir $SOLR_TOP/cloud84/solr/cloud/conf -confname cloud API= http://localhost:8984/solr/admin/collections curl -v "$API?action=CREATE&name=cloud&replicationFactor=2&numShards=1&collection.configName=cloud" Replica on cloud84 is elected as the initial leader. /clusterstate.json looks like: {"cloud":{ "shards":{"shard1":{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "core_node1": { "state":"active", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica1", "node_name":"cloud84:8984_solr", "leader":"true"} , "core_node2":{ "state":"active", "base_url":"http://cloud85:8985/solr", "core":"cloud_shard1_replica2", "node_name":"cloud85:8985_solr"}}}}, "maxShardsPerNode":"1", "router": {"name":"compositeId"} , "replicationFactor":"2"}} > STEP 2: Simulate network partition sudo iptables -I INPUT 1 -i lo -p tcp --sport 8985 -j DROP; sudo iptables -I INPUT 2 -i lo -p tcp --dport 8985 -j DROP Various ways to do this, but to keep it simple, I'm just dropping inbound traffic on localhost to port 8985. > STEP 3: Send document with ID “doc1” to leader on cloud84 curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' \ --data-binary '<add><doc><field name="id">doc1</field><field name="foo_s">bar</field></doc></add>' The update request takes some time because the replica is down but ultimately succeeds on the leader. In the logs on the leader, we have (some stack trace lines removed for clarity): 2013-12-23 10:59:33,688 [updateExecutor-1-thread-1] ERROR solr.update.StreamingSolrServers - error org.apache.http.conn.HttpHostConnectException: Connection to http://cloud85:8985 refused at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:190) ... at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:232) ... Caused by: java.net.ConnectException: Connection timed out ... 2013-12-23 10:59:33,695 [qtp1073932139-16] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica1] webapp=/solr path=/update params={} {add=[doc1 (1455228778490888192)]} 0 63256 2013-12-23 10:59:33,702 [updateExecutor-1-thread-2] INFO update.processor.DistributedUpdateProcessor - try and ask http://cloud85:8985/solr to recover 2013-12-23 10:59:48,718 [updateExecutor-1-thread-2] ERROR update.processor.DistributedUpdateProcessor - http://cloud85:8985/solr: Could not tell a replica to recover:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://cloud85:8985/solr at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:507) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.update.processor.DistributedUpdateProcessor$1.run(DistributedUpdateProcessor.java:657) ... Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to cloud85:8985 timed out ... Of course these log messages are expected. The key is that the leader accepted the update and now has one doc with ID "doc1" > STEP 4: Heal the network partition sudo service iptables restart (undoes the DROP rules we added above) > STEP 5: Send document with ID “doc2” to leader on cloud84 curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' \ --data-binary '<add><doc><field name="id">doc2</field><field name="foo_s">bar</field></doc></add>' Of course this time the update gets sent successfully to replica ... here are some log messages ... from the log on cloud84: 2013-12-23 11:00:46,982 [qtp1073932139-18] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica1] webapp=/solr path=/update params={} {add=[doc2 (1455228921389776896)]} 0 162 from the log on cloud85 (out-of-sync replica): 2013-12-23 10:47:26,363 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2) ... *2013-12-23 11:00:46,979 [qtp2124890785-12] INFO update.processor.LogUpdateProcessor - [cloud_shard1_replica2] webapp=/solr path=/update params={distrib.from=http://cloud84:8984/solr/cloud_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[doc2 (1455228921389776896)]} 0 142* Notice that there is no logged activity on cloud85 between 10:47 and 11:00 > STEP 6: Commit updates curl "http://localhost:8984/solr/cloud/update" -H 'Content-type:application/xml' --data-binary "<commit waitSearcher=\"true\"/>" > STEP 7: Send non-distributed queries to each replica curl "http://localhost:8984/solr/cloud/select?q=foo_s:bar&rows=0&wt=json&distrib=false" {"responseHeader":{"status":0,"QTime":1,"params":{"q":"foo_s:bar","distrib":"false","wt":"json","rows":"0"}},"response":{ "numFound":2 ,"start":0,"docs":[]}} curl "http://localhost:8985/solr/cloud/select?q=foo_s:bar&rows=0&wt=json&distrib=false" {"responseHeader":{"status":0,"QTime":1,"params":{"q":"foo_s:bar","distrib":"false","wt":"json","rows":"0"}},"response":{ "numFound":1 ,"start":0,"docs":[]}} Observe that the leader has 2 docs and the replica on cloud85 only has 1, but should have 2. From what I can tell, the replica that missed some updates because of a temporary network partition doesn't get any notification that it missed some documents. In other words, the out-of-sync replica doesn't know it's out-of-sync and it's state in ZooKeeper is active. As you can see from the log messages I posted in step 3 above, the leader tried to tell the replica to recovery, but due to the network partition, that request got dropped too. I'm wondering if the leader should send a state version tracking ID along with each update request so that a replica can detect that it's view of state was stale? I could see the process working as follows: 1. Shard leader now keeps track of a Slice state version tracking identifier that gets sent with every update request 2. Leader tries to send an update request (including the state version ID) to a replica and send fails 3. Leader updates the state version to a different value 4. Leader sends another update request to the replica; request includes the updated version ID; replica accepts the request but realizes its state version ID is out-of-date from what the leader sent 5. Replica enters recovery
          Hide
          Mark Miller added a comment -

          Yeah, that's currently expected. We don't expect the case where you can talk to ZooKeeper but not your replicas to be common, so we kind of punted on this scenario for the first phase.

          Some related JIRA issues:

          SOLR-5482
          SOLR-5450
          SOLR-5495

          I think we should do all that, but the key is really, in this case, we need to pass the order to recover through ZooKeeper to the partitioned off replica. With an eventually consistent model, it can be off for a short time, but it needs to recover in a timely manner.

          I think this is the right solution because the replica is sure to either get the information to recover from ZooKeeper or lose it's connection to ZooKeeper in which case it will have to recover anyway.

          Show
          Mark Miller added a comment - Yeah, that's currently expected. We don't expect the case where you can talk to ZooKeeper but not your replicas to be common, so we kind of punted on this scenario for the first phase. Some related JIRA issues: SOLR-5482 SOLR-5450 SOLR-5495 I think we should do all that, but the key is really, in this case, we need to pass the order to recover through ZooKeeper to the partitioned off replica. With an eventually consistent model, it can be off for a short time, but it needs to recover in a timely manner. I think this is the right solution because the replica is sure to either get the information to recover from ZooKeeper or lose it's connection to ZooKeeper in which case it will have to recover anyway.
          Hide
          Mark Miller added a comment -

          so we kind of punted

          The other thing to note is that if you restart the shard or that node or the cluster, you should be able to do it without losing any data. It will recover from the leader when everything else is working correctly.

          Show
          Mark Miller added a comment - so we kind of punted The other thing to note is that if you restart the shard or that node or the cluster, you should be able to do it without losing any data. It will recover from the leader when everything else is working correctly.
          Hide
          Timothy Potter added a comment -

          Thanks Mark, I suspected my test case was a little cherry picked ... something interesting happened when I also severed the connection between the replica and ZK (ie. same test as above but I also dropped the ZK connection on the replica).

          2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None
          2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager - zkClient has disconnected

          >>> fixed the connection between replica and ZK here <<<

          2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None
          2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager - Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...
          2013-12-23 15:40:45,580 [main-EventThread] INFO common.cloud.DefaultConnectionStrategy - Connection expired - starting a new one...
          2013-12-23 15:40:45,586 [main-EventThread] INFO common.cloud.ConnectionManager - Waiting for client to connect to ZooKeeper
          2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
          2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Client is connected to ZooKeeper
          2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Connection with ZooKeeper reestablished.
          2013-12-23 15:40:45,596 [main-EventThread] WARN solr.cloud.RecoveryStrategy - Stopping recovery for zkNodeName=core_node3core=cloud_shard1_replica3
          2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=down
          2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property
          2013-12-23 15:40:45,905 [qtp2124890785-14] INFO handler.admin.CoreAdminHandler - It has been requested that we recover
          2013-12-23 15:40:45,906 [qtp2124890785-14] INFO solr.servlet.SolrDispatchFilter - [admin] webapp=null path=/admin/cores params=

          {action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2}

          status=0 QTime=2
          2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering
          2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property
          2013-12-23 15:40:45,920 [Thread-17] INFO solr.update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery
          2013-12-23 15:40:45,921 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Starting recovery process. core=cloud_shard1_replica3 recoveringAfterStartup=false
          2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering
          2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property
          2013-12-23 15:40:48,613 [qtp2124890785-15] INFO solr.core.SolrCore - [cloud_shard1_replica3] webapp=/solr path=/select params=

          {q=foo_s:bar&distrib=false&wt=json&rows=0}

          hits=0 status=0 QTime=1
          2013-12-23 15:42:42,770 [qtp2124890785-13] INFO solr.core.SolrCore - [cloud_shard1_replica3] webapp=/solr path=/select params=

          {q=foo_s:bar&distrib=false&wt=json&rows=0}

          hits=0 status=0 QTime=1
          2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController - There was a problem making a request to the leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state down for cloud86:8986_solr but I still do not see the requested state. I see state: recovering live:false
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
          at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434)
          at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:347)
          at org.apache.solr.cloud.ZkController.access$100(ZkController.java:85)
          at org.apache.solr.cloud.ZkController$1.command(ZkController.java:225)
          at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:118)
          at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56)
          at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:93)
          at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
          at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)

          2013-12-23 15:42:45,963 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy - Error while trying to recover. core=cloud_shard1_replica3:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for cloud86:8986_solr but I still do not see the requested state. I see state: recovering live:false
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
          at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
          at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:224)
          at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371)
          at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247)

          2013-12-23 15:42:45,964 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy - Recovery failed - trying again... (0) core=cloud_shard1_replica3
          2013-12-23 15:42:45,964 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Wait 2.0 seconds before trying to recover again (1)
          2013-12-23 15:42:47,964 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering

          Show
          Timothy Potter added a comment - Thanks Mark, I suspected my test case was a little cherry picked ... something interesting happened when I also severed the connection between the replica and ZK (ie. same test as above but I also dropped the ZK connection on the replica). 2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None 2013-12-23 15:39:57,170 [main-EventThread] INFO common.cloud.ConnectionManager - zkClient has disconnected >>> fixed the connection between replica and ZK here <<< 2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None 2013-12-23 15:40:45,579 [main-EventThread] INFO common.cloud.ConnectionManager - Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... 2013-12-23 15:40:45,580 [main-EventThread] INFO common.cloud.DefaultConnectionStrategy - Connection expired - starting a new one... 2013-12-23 15:40:45,586 [main-EventThread] INFO common.cloud.ConnectionManager - Waiting for client to connect to ZooKeeper 2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Watcher org.apache.solr.common.cloud.ConnectionManager@4f857c62 name:ZooKeeperConnection Watcher:ec2-54-197-0-103.compute-1.amazonaws.com:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None 2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Client is connected to ZooKeeper 2013-12-23 15:40:45,595 [main-EventThread] INFO common.cloud.ConnectionManager - Connection with ZooKeeper reestablished. 2013-12-23 15:40:45,596 [main-EventThread] WARN solr.cloud.RecoveryStrategy - Stopping recovery for zkNodeName=core_node3core=cloud_shard1_replica3 2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=down 2013-12-23 15:40:45,597 [main-EventThread] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property 2013-12-23 15:40:45,905 [qtp2124890785-14] INFO handler.admin.CoreAdminHandler - It has been requested that we recover 2013-12-23 15:40:45,906 [qtp2124890785-14] INFO solr.servlet.SolrDispatchFilter - [admin] webapp=null path=/admin/cores params= {action=REQUESTRECOVERY&core=cloud_shard1_replica3&wt=javabin&version=2} status=0 QTime=2 2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering 2013-12-23 15:40:45,909 [Thread-17] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property 2013-12-23 15:40:45,920 [Thread-17] INFO solr.update.DefaultSolrCoreState - Running recovery - first canceling any ongoing recovery 2013-12-23 15:40:45,921 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Starting recovery process. core=cloud_shard1_replica3 recoveringAfterStartup=false 2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering 2013-12-23 15:40:45,924 [RecoveryThread] INFO solr.cloud.ZkController - numShards not found on descriptor - reading it from system property 2013-12-23 15:40:48,613 [qtp2124890785-15] INFO solr.core.SolrCore - [cloud_shard1_replica3] webapp=/solr path=/select params= {q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 2013-12-23 15:42:42,770 [qtp2124890785-13] INFO solr.core.SolrCore - [cloud_shard1_replica3] webapp=/solr path=/select params= {q=foo_s:bar&distrib=false&wt=json&rows=0} hits=0 status=0 QTime=1 2013-12-23 15:42:45,650 [main-EventThread] ERROR solr.cloud.ZkController - There was a problem making a request to the leader:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state down for cloud86:8986_solr but I still do not see the requested state. I see state: recovering live:false at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1434) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:347) at org.apache.solr.cloud.ZkController.access$100(ZkController.java:85) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:225) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:118) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:56) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:93) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-23 15:42:45,963 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy - Error while trying to recover. core=cloud_shard1_replica3:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for cloud86:8986_solr but I still do not see the requested state. I see state: recovering live:false at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:224) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:247) 2013-12-23 15:42:45,964 [RecoveryThread] ERROR solr.cloud.RecoveryStrategy - Recovery failed - trying again... (0) core=cloud_shard1_replica3 2013-12-23 15:42:45,964 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Wait 2.0 seconds before trying to recover again (1) 2013-12-23 15:42:47,964 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=cloud_shard1_replica3 state=recovering
          Hide
          Mark Miller added a comment -

          That's interesting. The logging makes it look like it's not creating it's new ephemeral live node for some reason...or the leader is not getting an updated view of the live node...

          Show
          Mark Miller added a comment - That's interesting. The logging makes it look like it's not creating it's new ephemeral live node for some reason...or the leader is not getting an updated view of the live node...
          Hide
          Markus Jelsma added a comment -

          Ok, I followed all the great work here and in related tickets and yesterday i had the time to rebuild Solr and check for this issue. I hadn't seen it yesterday but it is right in front of me again, using a fresh build from January 6th.

          Leader has Num Docs: 379659
          Replica has Num Docs: 379661

          Show
          Markus Jelsma added a comment - Ok, I followed all the great work here and in related tickets and yesterday i had the time to rebuild Solr and check for this issue. I hadn't seen it yesterday but it is right in front of me again, using a fresh build from January 6th. Leader has Num Docs: 379659 Replica has Num Docs: 379661
          Hide
          Timothy Potter added a comment -

          While doing some other testing of SolrCloud (branch4x - 4.7-SNAPSHOT rev. 1556055), I hit this issue and here's the kicker ... there were no errors in my replica's log, the tlogs are identical, and there was no significant GC activity during the time where the replica got out of sync with the leader. I'm attaching the data directories (index + tlog) for both replicas (demo_shard1_replica1 [leader], and demo_shard1_replica2) and their log files. When I do a doc-by-doc comparison of the two indexes, here's the result:

          >> finished querying replica1, found 33537 documents (33537)
          >> finished querying replica2, found 33528 documents
          Doc [82995] not found in replica2: <doc boost="1.0"><field name="id">82995</field><field name="string_s">test</field><field name="int_i">-274468088</field><field name="float_f">0.90338105</field><field name="double_d">0.6949391474539932</field><field name="text_en">this is a test</field><field name="version">1456683668206518274</field></doc>
          Doc [82997] not found in replica2: <doc boost="1.0"><field name="id">82997</field><field name="string_s">test</field><field name="int_i">301737117</field><field name="float_f">0.6746266</field><field name="double_d">0.26034065188918565</field><field name="text_en">this is a test</field><field name="version">1456683668206518276</field></doc>
          Doc [82996] not found in replica2: <doc boost="1.0"><field name="id">82996</field><field name="string_s">test</field><field name="int_i">-1768315588</field><field name="float_f">0.6641093</field><field name="double_d">0.23708033183534993</field><field name="text_en">this is a test</field><field name="version">1456683668206518275</field></doc>
          Doc [82991] not found in replica2: <doc boost="1.0"><field name="id">82991</field><field name="string_s">test</field><field name="int_i">-2057280061</field><field name="float_f">0.27617514</field><field name="double_d">0.7885214691953506</field><field name="text_en">this is a test</field><field name="version">1456683668206518273</field></doc>
          Doc [82987] not found in replica2: <doc boost="1.0"><field name="id">82987</field><field name="string_s">test</field><field name="int_i">1051456320</field><field name="float_f">0.51863414</field><field name="double_d">0.7881255443862878</field><field name="text_en">this is a test</field><field name="version">1456683668206518272</field></doc>
          Doc [82986] not found in replica2: <doc boost="1.0"><field name="id">82986</field><field name="string_s">test</field><field name="int_i">-1356807889</field><field name="float_f">0.2762279</field><field name="double_d">0.003657816979820372</field><field name="text_en">this is a test</field><field name="version">1456683668205469699</field></doc>
          Doc [82984] not found in replica2: <doc boost="1.0"><field name="id">82984</field><field name="string_s">test</field><field name="int_i">732678870</field><field name="float_f">0.31199205</field><field name="double_d">0.9848865821766198</field><field name="text_en">this is a test</field><field name="version">1456683668205469698</field></doc>
          Doc [82970] not found in replica2: <doc boost="1.0"><field name="id">82970</field><field name="string_s">test</field><field name="int_i">283693979</field><field name="float_f">0.6119651</field><field name="double_d">0.04142006867388914</field><field name="text_en">this is a test</field><field name="version">1456683668205469696</field></doc>
          Doc [82973] not found in replica2: <doc boost="1.0"><field name="id">82973</field><field name="string_s">test</field><field name="int_i">1343103920</field><field name="float_f">0.5855809</field><field name="double_d">0.6575904716584224</field><field name="text_en">this is a test</field><field name="version">1456683668205469697</field></doc>

          No amount of committing or reloading of these cores helps. Also, restarting replica2 doesn't lead to it being in-sync either, most likely because the tlog is identical to the leader? Here's the log messages on replica2 after restarting it.

          2014-01-08 13:28:20,112 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - [demo_shard1_replica2] Registered new searcher Searcher@4345de8a main

          {StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)}

          2014-01-08 13:28:21,298 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Attempting to PeerSync from http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ core=demo_shard1_replica2 - recoveringAfterStartup=true
          2014-01-08 13:28:21,302 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr START replicas=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ nUpdates=100
          2014-01-08 13:28:21,330 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Received 99 versions from ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/
          2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Our versions are newer. ourLowThreshold=1456683689417113603 otherHigh=1456683689602711553
          2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr DONE. sync succeeded
          2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.UpdateHandler - start commit

          {,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}

          2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false
          2014-01-08 13:28:21,333 [RecoveryThread] INFO solr.update.UpdateHandler - No uncommitted changes. Skipping IW.commit.
          2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false
          2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream - [IW][RecoveryThread]: nrtIsCurrent: infoVersion matches: true; DW changes: false; BD changes: false
          2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false
          2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.search.SolrIndexSearcher - Opening Searcher@5fc2a9d main
          2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - QuerySenderListener sending requests to Searcher@5fc2a9d main

          {StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)}

          2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - QuerySenderListener done.
          2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - [demo_shard1_replica2] Registered new searcher Searcher@5fc2a9d main

          {StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)}

          2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.update.UpdateHandler - end_commit_flush
          2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - PeerSync Recovery was successful - registering as Active. core=demo_shard1_replica2
          2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=demo_shard1_replica2 state=active
          2014-01-08 13:28:21,370 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6)

          Thus, it would seem there might be some code that's outright losing documents (almost feels like a last batch not flushed error but more subtle as it's not easy to reproduce this all the time)

          Show
          Timothy Potter added a comment - While doing some other testing of SolrCloud (branch4x - 4.7-SNAPSHOT rev. 1556055), I hit this issue and here's the kicker ... there were no errors in my replica's log, the tlogs are identical, and there was no significant GC activity during the time where the replica got out of sync with the leader. I'm attaching the data directories (index + tlog) for both replicas (demo_shard1_replica1 [leader] , and demo_shard1_replica2) and their log files. When I do a doc-by-doc comparison of the two indexes, here's the result: >> finished querying replica1, found 33537 documents (33537) >> finished querying replica2, found 33528 documents Doc [82995] not found in replica2: <doc boost="1.0"><field name="id">82995</field><field name="string_s">test</field><field name="int_i">-274468088</field><field name="float_f">0.90338105</field><field name="double_d">0.6949391474539932</field><field name="text_en">this is a test</field><field name=" version ">1456683668206518274</field></doc> Doc [82997] not found in replica2: <doc boost="1.0"><field name="id">82997</field><field name="string_s">test</field><field name="int_i">301737117</field><field name="float_f">0.6746266</field><field name="double_d">0.26034065188918565</field><field name="text_en">this is a test</field><field name=" version ">1456683668206518276</field></doc> Doc [82996] not found in replica2: <doc boost="1.0"><field name="id">82996</field><field name="string_s">test</field><field name="int_i">-1768315588</field><field name="float_f">0.6641093</field><field name="double_d">0.23708033183534993</field><field name="text_en">this is a test</field><field name=" version ">1456683668206518275</field></doc> Doc [82991] not found in replica2: <doc boost="1.0"><field name="id">82991</field><field name="string_s">test</field><field name="int_i">-2057280061</field><field name="float_f">0.27617514</field><field name="double_d">0.7885214691953506</field><field name="text_en">this is a test</field><field name=" version ">1456683668206518273</field></doc> Doc [82987] not found in replica2: <doc boost="1.0"><field name="id">82987</field><field name="string_s">test</field><field name="int_i">1051456320</field><field name="float_f">0.51863414</field><field name="double_d">0.7881255443862878</field><field name="text_en">this is a test</field><field name=" version ">1456683668206518272</field></doc> Doc [82986] not found in replica2: <doc boost="1.0"><field name="id">82986</field><field name="string_s">test</field><field name="int_i">-1356807889</field><field name="float_f">0.2762279</field><field name="double_d">0.003657816979820372</field><field name="text_en">this is a test</field><field name=" version ">1456683668205469699</field></doc> Doc [82984] not found in replica2: <doc boost="1.0"><field name="id">82984</field><field name="string_s">test</field><field name="int_i">732678870</field><field name="float_f">0.31199205</field><field name="double_d">0.9848865821766198</field><field name="text_en">this is a test</field><field name=" version ">1456683668205469698</field></doc> Doc [82970] not found in replica2: <doc boost="1.0"><field name="id">82970</field><field name="string_s">test</field><field name="int_i">283693979</field><field name="float_f">0.6119651</field><field name="double_d">0.04142006867388914</field><field name="text_en">this is a test</field><field name=" version ">1456683668205469696</field></doc> Doc [82973] not found in replica2: <doc boost="1.0"><field name="id">82973</field><field name="string_s">test</field><field name="int_i">1343103920</field><field name="float_f">0.5855809</field><field name="double_d">0.6575904716584224</field><field name="text_en">this is a test</field><field name=" version ">1456683668205469697</field></doc> No amount of committing or reloading of these cores helps. Also, restarting replica2 doesn't lead to it being in-sync either, most likely because the tlog is identical to the leader? Here's the log messages on replica2 after restarting it. 2014-01-08 13:28:20,112 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - [demo_shard1_replica2] Registered new searcher Searcher@4345de8a main {StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)} 2014-01-08 13:28:21,298 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Attempting to PeerSync from http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ core=demo_shard1_replica2 - recoveringAfterStartup=true 2014-01-08 13:28:21,302 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url= http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr START replicas= http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ nUpdates=100 2014-01-08 13:28:21,330 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url= http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Received 99 versions from ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ 2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url= http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Our versions are newer. ourLowThreshold=1456683689417113603 otherHigh=1456683689602711553 2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url= http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr DONE. sync succeeded 2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.UpdateHandler - start commit {,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW] [RecoveryThread] : anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false 2014-01-08 13:28:21,333 [RecoveryThread] INFO solr.update.UpdateHandler - No uncommitted changes. Skipping IW.commit. 2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW] [RecoveryThread] : anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false 2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream - [IW] [RecoveryThread] : nrtIsCurrent: infoVersion matches: true; DW changes: false; BD changes: false 2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW] [RecoveryThread] : anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false 2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.search.SolrIndexSearcher - Opening Searcher@5fc2a9d main 2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - QuerySenderListener sending requests to Searcher@5fc2a9d main {StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)} 2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - QuerySenderListener done. 2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - [demo_shard1_replica2] Registered new searcher Searcher@5fc2a9d main {StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)} 2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.update.UpdateHandler - end_commit_flush 2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - PeerSync Recovery was successful - registering as Active. core=demo_shard1_replica2 2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=demo_shard1_replica2 state=active 2014-01-08 13:28:21,370 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6) Thus, it would seem there might be some code that's outright losing documents (almost feels like a last batch not flushed error but more subtle as it's not easy to reproduce this all the time)
          Hide
          Mark Miller added a comment -

          I've noticed something like this too - but nothing i could reproduce easily. I imagine it's likely an issue in SolrCmdDistributor.

          Show
          Mark Miller added a comment - I've noticed something like this too - but nothing i could reproduce easily. I imagine it's likely an issue in SolrCmdDistributor.
          Hide
          Mark Miller added a comment -

          Although that doesn't really jive with the tran logs being identical...hmm...

          Show
          Mark Miller added a comment - Although that doesn't really jive with the tran logs being identical...hmm...
          Hide
          Mark Miller added a comment -

          No, wait, it could jive. We only check the last 99 docs on peer sync - if bunch of docs just didn't show up well before that, it wouldn't be detected by peer sync. I still think SolrCmdDistributor is the first place to look.

          Show
          Mark Miller added a comment - No, wait, it could jive. We only check the last 99 docs on peer sync - if bunch of docs just didn't show up well before that, it wouldn't be detected by peer sync. I still think SolrCmdDistributor is the first place to look.
          Hide
          Timothy Potter added a comment - - edited

          Still digging into it ... I'm curious why a batch of 34 adds on the leader gets processed as several sub-batches on the replica? Here's what I'm seeing the logs around the documents that are missing from the replica. Basically, there are 34 docs on the leader and only 25 processed in 4 separate batches (from my counting of the logs) on the replica. Why wouldn't it just be one for one? The docs are all roughly the same size ... and what's breaking it up? Having trouble seeing that in the logs / code

          On the leader:

          2014-01-08 12:23:21,501 [qtp604104855-17] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica1] webapp=/solr path=/update params=

          {wt=javabin&version=2} {add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 (1456683668181352449), 82904 (1456683668181352450), 82912 (1456683668187643904), 82913 (1456683668188692480), 82914 (1456683668188692481), 82916 (1456683668188692482), 82917 (1456683668188692483), 82918 (1456683668188692484), ... (34 adds)]} 0 34

          >>>> NOT ALL OF THE 34 DOCS MENTIONED ABOVE MAKE IT TO THE REPLICA <<<<<

          2014-01-08 12:23:21,600 [qtp604104855-17] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica1] webapp=/solr path=/update params={wt=javabin&version=2}

          {add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 (1456683668286210049), 83011 (1456683668286210050), 83012 (1456683668286210051), 83013 (1456683668287258624), 83018 (1456683668287258625), 83019 (1456683668289355776), 83023 (1456683668289355777), 83024 (1456683668289355778), ... (43 adds)]}

          0 32

          On the replica:

          2014-01-08 12:23:21,126 [qtp604104855-22] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica2] webapp=/solr path=/update params=

          {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}

          {add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 (1456683668181352449)]}

          0 1

          2014-01-08 12:23:21,134 [qtp604104855-22] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica2] webapp=/solr path=/update params=

          {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}

          {add=[82904 (1456683668181352450), 82912 (1456683668187643904), 82913 (1456683668188692480), 82914 (1456683668188692481), 82916 (1456683668188692482), 82917 (1456683668188692483), 82918 (1456683668188692484), 82919 (1456683668188692485), 82922 (1456683668188692486)]}

          0 2

          2014-01-08 12:23:21,139 [qtp604104855-22] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica2] webapp=/solr path=/update params=

          {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}

          {add=[82923 (1456683668188692487), 82926 (1456683668190789632), 82928 (1456683668190789633), 82932 (1456683668190789634), 82939 (1456683668192886784), 82945 (1456683668192886785), 82946 (1456683668192886786), 82947 (1456683668193935360), 82952 (1456683668193935361), 82962 (1456683668193935362), ... (12 adds)]}

          0 3

          2014-01-08 12:23:21,144 [qtp604104855-22] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica2] webapp=/solr path=/update params=

          {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}

          {add=[82967 (1456683668199178240)]}

          0 0

          >>>> 9 Docs Missing here <<<<

          2014-01-08 12:23:21,227 [qtp604104855-22] INFO update.processor.LogUpdateProcessor -
          [demo_shard1_replica2] webapp=/solr path=/update params=

          {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2}

          {add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 (1456683668286210049), 83011 (1456683668286210050), 83012 (1456683668286210051), 83013 (1456683668287258624)]}

          0 2

          Note the add log message starting with doc ID 83002 is just included here for context to show where the leader / replica got out of sync.

          Show
          Timothy Potter added a comment - - edited Still digging into it ... I'm curious why a batch of 34 adds on the leader gets processed as several sub-batches on the replica? Here's what I'm seeing the logs around the documents that are missing from the replica. Basically, there are 34 docs on the leader and only 25 processed in 4 separate batches (from my counting of the logs) on the replica. Why wouldn't it just be one for one? The docs are all roughly the same size ... and what's breaking it up? Having trouble seeing that in the logs / code On the leader: 2014-01-08 12:23:21,501 [qtp604104855-17] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica1] webapp=/solr path=/update params= {wt=javabin&version=2} {add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 (1456683668181352449), 82904 (1456683668181352450), 82912 (1456683668187643904), 82913 (1456683668188692480), 82914 (1456683668188692481), 82916 (1456683668188692482), 82917 (1456683668188692483), 82918 (1456683668188692484), ... (34 adds)]} 0 34 >>>> NOT ALL OF THE 34 DOCS MENTIONED ABOVE MAKE IT TO THE REPLICA <<<<< 2014-01-08 12:23:21,600 [qtp604104855-17] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica1] webapp=/solr path=/update params={wt=javabin&version=2} {add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 (1456683668286210049), 83011 (1456683668286210050), 83012 (1456683668286210051), 83013 (1456683668287258624), 83018 (1456683668287258625), 83019 (1456683668289355776), 83023 (1456683668289355777), 83024 (1456683668289355778), ... (43 adds)]} 0 32 On the replica: 2014-01-08 12:23:21,126 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params= {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82900 (1456683668174012416), 82901 (1456683668181352448), 82903 (1456683668181352449)]} 0 1 2014-01-08 12:23:21,134 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params= {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82904 (1456683668181352450), 82912 (1456683668187643904), 82913 (1456683668188692480), 82914 (1456683668188692481), 82916 (1456683668188692482), 82917 (1456683668188692483), 82918 (1456683668188692484), 82919 (1456683668188692485), 82922 (1456683668188692486)]} 0 2 2014-01-08 12:23:21,139 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params= {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82923 (1456683668188692487), 82926 (1456683668190789632), 82928 (1456683668190789633), 82932 (1456683668190789634), 82939 (1456683668192886784), 82945 (1456683668192886785), 82946 (1456683668192886786), 82947 (1456683668193935360), 82952 (1456683668193935361), 82962 (1456683668193935362), ... (12 adds)]} 0 3 2014-01-08 12:23:21,144 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params= {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[82967 (1456683668199178240)]} 0 0 >>>> 9 Docs Missing here <<<< 2014-01-08 12:23:21,227 [qtp604104855-22] INFO update.processor.LogUpdateProcessor - [demo_shard1_replica2] webapp=/solr path=/update params= {distrib.from=http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[83002 (1456683668280967168), 83005 (1456683668286210048), 83008 (1456683668286210049), 83011 (1456683668286210050), 83012 (1456683668286210051), 83013 (1456683668287258624)]} 0 2 Note the add log message starting with doc ID 83002 is just included here for context to show where the leader / replica got out of sync.
          Hide
          Yonik Seeley added a comment - - edited

          Basically, there are 34 docs on the leader and only 25 processed in 4 separate batches (from my counting of the logs) on the replica. Why wouldn't it just be one for one? The docs are all roughly the same size ... and what's breaking it up?

          ConcurrentUpdateSolrServer? If another doc doesn't come in quickly enough (250ms by default), it ends the batch.
          I thought there used to be a doc count limit too or something... but after a quick scan, I'm not seeing it.

          Show
          Yonik Seeley added a comment - - edited Basically, there are 34 docs on the leader and only 25 processed in 4 separate batches (from my counting of the logs) on the replica. Why wouldn't it just be one for one? The docs are all roughly the same size ... and what's breaking it up? ConcurrentUpdateSolrServer? If another doc doesn't come in quickly enough (250ms by default), it ends the batch. I thought there used to be a doc count limit too or something... but after a quick scan, I'm not seeing it.
          Hide
          Timothy Potter added a comment - - edited

          Cuss on CUSS Thanks, I sometimes forget that the client-side batch gets broken into individual AddUpdateCommands when sending on to the replicas.

          Show
          Timothy Potter added a comment - - edited Cuss on CUSS Thanks, I sometimes forget that the client-side batch gets broken into individual AddUpdateCommands when sending on to the replicas.
          Hide
          Mark Miller added a comment -

          In this case there is no wait due to the massive penalty it puts on doc per request speed.

          Show
          Mark Miller added a comment - In this case there is no wait due to the massive penalty it puts on doc per request speed.
          Hide
          Mark Miller added a comment -

          SOLR-5625: Add to testing for SolrCmdDistributor

          Show
          Mark Miller added a comment - SOLR-5625 : Add to testing for SolrCmdDistributor
          Hide
          Timothy Potter added a comment -

          Make sense about not waiting because of the penalty now that I've had a chance to get into the details of that code.

          I spent a lot of time on Friday and over the weekend trying to track down the docs getting dropped. Unfortunately have not been able to track down the source of the issue yet. I'm fairly certain the issue happens before docs get submitted to CUSS, meaning that the lost docs never seemed to hit the queue in ConcurrentUpdateSolrServer. My original thinking was that given the complex nature of CUSS, there might be some sort of race condition but after having added a log of what hit the queue, it seems that the documents that get lost never hit the queue. Not to mention that the actual use of CUSS is mostly single-threaded because StreamingSolrServers construct them with a threadCount of 1.

          As a side note, one thing I noticed while is that direct updates don't necessarily hit the correct core initially when a Solr node hosts more than one shard per collection. In other words, if host X had shard1 and shard3 of collection foo, then some update requests would hit shard1 on host X when they should go to shard3 on the same host; shard1 correctly forwards them on but it's still an extra hop. Of course that is probably not a big deal in production as it would be rare to host multiple shards of the same collection in the same Solr host, unless they are over-sharding.

          In terms of this issue, here's what I'm seeing:

          Assume a SolrCloud environment with shard1 having replicas on host A and B; A is the current leader
          client sends direct update request to shard1 on host A containing 3 docs (1,2,3) (for example)
          batch from client gets broken up into individual docs (during request parsing)
          docs 1,2,3 get indexed on host A (the leader)
          docs 1 and 2 get queued into CUSS and sent on to the replica on host B (sometimes in the same request, sometimes in separate requests)
          doc 3 never makes it and from what I can tell, never hits the queue

          This may be anecdotal but from what I can tell, it's always docs on the end of a batch and not in the middle. Meaning that I haven't seen a case where 1 and 3 make it and 2 not ... maybe useful, maybe not. The only other thing I'll mention is it does seem timing / race condition related as it's almost impossible to reproduce this on my Mac when running 2 shards across 2 nodes but much easier to trigger if I ramp up to say 8 shards on 2 nodes, i.e. the busier my CPU is, the easier it is to see docs getting dropped.

          Show
          Timothy Potter added a comment - Make sense about not waiting because of the penalty now that I've had a chance to get into the details of that code. I spent a lot of time on Friday and over the weekend trying to track down the docs getting dropped. Unfortunately have not been able to track down the source of the issue yet. I'm fairly certain the issue happens before docs get submitted to CUSS, meaning that the lost docs never seemed to hit the queue in ConcurrentUpdateSolrServer. My original thinking was that given the complex nature of CUSS, there might be some sort of race condition but after having added a log of what hit the queue, it seems that the documents that get lost never hit the queue. Not to mention that the actual use of CUSS is mostly single-threaded because StreamingSolrServers construct them with a threadCount of 1. As a side note, one thing I noticed while is that direct updates don't necessarily hit the correct core initially when a Solr node hosts more than one shard per collection. In other words, if host X had shard1 and shard3 of collection foo, then some update requests would hit shard1 on host X when they should go to shard3 on the same host; shard1 correctly forwards them on but it's still an extra hop. Of course that is probably not a big deal in production as it would be rare to host multiple shards of the same collection in the same Solr host, unless they are over-sharding. In terms of this issue, here's what I'm seeing: Assume a SolrCloud environment with shard1 having replicas on host A and B; A is the current leader client sends direct update request to shard1 on host A containing 3 docs (1,2,3) (for example) batch from client gets broken up into individual docs (during request parsing) docs 1,2,3 get indexed on host A (the leader) docs 1 and 2 get queued into CUSS and sent on to the replica on host B (sometimes in the same request, sometimes in separate requests) doc 3 never makes it and from what I can tell, never hits the queue This may be anecdotal but from what I can tell, it's always docs on the end of a batch and not in the middle. Meaning that I haven't seen a case where 1 and 3 make it and 2 not ... maybe useful, maybe not. The only other thing I'll mention is it does seem timing / race condition related as it's almost impossible to reproduce this on my Mac when running 2 shards across 2 nodes but much easier to trigger if I ramp up to say 8 shards on 2 nodes, i.e. the busier my CPU is, the easier it is to see docs getting dropped.
          Hide
          Mark Miller added a comment -

          How many threads are you using to load docs? How large are the batches?

          Show
          Mark Miller added a comment - How many threads are you using to load docs? How large are the batches?
          Hide
          Timothy Potter added a comment -

          Oddly enough, just 1 indexing thread on the client side and batches of around 30-40 docs per shard (ie I set my batch size so that direct updates send about 30-40 per shard to the leaders from the client side).

          Show
          Timothy Potter added a comment - Oddly enough, just 1 indexing thread on the client side and batches of around 30-40 docs per shard (ie I set my batch size so that direct updates send about 30-40 per shard to the leaders from the client side).
          Hide
          Mark Miller added a comment -

          Markus Jelsma, are you loading docs via the bulk methods or cuss or what?

          Timothy Potter, I think I'm seeing your issue. Have not gotten to the bottom of it yet, but if I am seeing the same thing, it seems those docs are being setup to send to 0 replicas. Trying to figure out why/how.

          Show
          Mark Miller added a comment - Markus Jelsma , are you loading docs via the bulk methods or cuss or what? Timothy Potter , I think I'm seeing your issue. Have not gotten to the bottom of it yet, but if I am seeing the same thing, it seems those docs are being setup to send to 0 replicas. Trying to figure out why/how.
          Hide
          Mark Miller added a comment -

          FYI, I also had to overshard to see anything.

          Show
          Mark Miller added a comment - FYI, I also had to overshard to see anything.
          Hide
          Markus Jelsma added a comment -

          Mark - We use CloudSolrServer and send batches of around 380 documents from Nutch. I am not sure what actual implementation we get back when connecting.

          Show
          Markus Jelsma added a comment - Mark - We use CloudSolrServer and send batches of around 380 documents from Nutch. I am not sure what actual implementation we get back when connecting.
          Hide
          Markus Jelsma added a comment -

          I also think i'm seeing this happening right now with a trunk build of yesterday. I am slowly indexing few hundred docs every few minutes for quite some time for fixing a Nutch issue. Looks like i can restart it because replica's are already out of sync

          Show
          Markus Jelsma added a comment - I also think i'm seeing this happening right now with a trunk build of yesterday. I am slowly indexing few hundred docs every few minutes for quite some time for fixing a Nutch issue. Looks like i can restart it because replica's are already out of sync
          Hide
          Mark Miller added a comment -

          Markus Jelsma, are you indexing to an oversharded cluster?

          Show
          Mark Miller added a comment - Markus Jelsma , are you indexing to an oversharded cluster?
          Hide
          Markus Jelsma added a comment - - edited

          Mark, no, each node holds a single JVM and single core. We have 5 leaders and five replica's, so ten nodes.

          Show
          Markus Jelsma added a comment - - edited Mark, no, each node holds a single JVM and single core. We have 5 leaders and five replica's, so ten nodes.
          Hide
          Mark Miller added a comment -

          Well the affects I was seeing related to having a control collection with a core named collection1 and another collection called collection1. Over shard, and that causes some similar looking effects.

          I've addressed that and will see if ramping up my tests can spot anything - so far cannot replicate in a test though.

          Show
          Mark Miller added a comment - Well the affects I was seeing related to having a control collection with a core named collection1 and another collection called collection1. Over shard, and that causes some similar looking effects. I've addressed that and will see if ramping up my tests can spot anything - so far cannot replicate in a test though.
          Hide
          Markus Jelsma added a comment - - edited

          Did something crucial change recently? Since at least the 13th, maybe earlier, indexing small segments from Nutch in several cycles (few hundred per cycle), some shards get out of sync really quick! I did lots of tests before that but didn't see it happening before.

          Ok, someting did change. I reverted back to a build of the 6th and everything is fine. I can run the index process from Nutch for many cycles and more data. No shard is going out of sync with itself.

          Show
          Markus Jelsma added a comment - - edited Did something crucial change recently? Since at least the 13th, maybe earlier, indexing small segments from Nutch in several cycles (few hundred per cycle), some shards get out of sync really quick! I did lots of tests before that but didn't see it happening before. Ok, someting did change. I reverted back to a build of the 6th and everything is fine. I can run the index process from Nutch for many cycles and more data. No shard is going out of sync with itself.
          Hide
          Markus Jelsma added a comment -

          Correction: it happens on a build of the 6th as well, although it doens't look that bad as when index to a 13th build.

          Show
          Markus Jelsma added a comment - Correction: it happens on a build of the 6th as well, although it doens't look that bad as when index to a 13th build.
          Hide
          Markus Jelsma added a comment -

          Seems autocommit has something to do with triggering the problem, at least in my case.

          • 13th build without autocommit: out of sync very soon
          • 13th build with autocommit: out of sync after a while
          • 6th build without autocommit: out of sync after a while
          • 6th build with autocommit: out of sync after many more documents
          Show
          Markus Jelsma added a comment - Seems autocommit has something to do with triggering the problem, at least in my case. 13th build without autocommit: out of sync very soon 13th build with autocommit: out of sync after a while 6th build without autocommit: out of sync after a while 6th build with autocommit: out of sync after many more documents
          Hide
          Joel Bernstein added a comment -

          The commit behavior is interesting. I'm seeing docs flushing from the leader to replica following a manual hard commit issued long after indexing has stopped. That means somewhere along the way docs are buffered and waiting for an event to flush them to the replica. I haven't figured out just yet where the buffering is occurring but I'm trying to track it down.

          Show
          Joel Bernstein added a comment - The commit behavior is interesting. I'm seeing docs flushing from the leader to replica following a manual hard commit issued long after indexing has stopped. That means somewhere along the way docs are buffered and waiting for an event to flush them to the replica. I haven't figured out just yet where the buffering is occurring but I'm trying to track it down.
          Hide
          Mark Miller added a comment -

          I spent some time a while back trying to find a fault in ConcurrentSolrServer#blockUntilFinished - didn't uncover anything yet though.

          Show
          Mark Miller added a comment - I spent some time a while back trying to find a fault in ConcurrentSolrServer#blockUntilFinished - didn't uncover anything yet though.
          Hide
          Joel Bernstein added a comment -

          I'm betting it's something in the streaming. This afternoon I'm going to put some debugging in to see if the docs being flushed by the commit were already written to the stream. My bet is that they were, and that the commit is pushing them all the way through to the replica.

          Show
          Joel Bernstein added a comment - I'm betting it's something in the streaming. This afternoon I'm going to put some debugging in to see if the docs being flushed by the commit were already written to the stream. My bet is that they were, and that the commit is pushing them all the way through to the replica.
          Hide
          Timothy Potter added a comment -

          I was able to reproduce this issue on EC2 without any over-sharding (on latest rev on branch_4x) ... basically 6 Solr nodes with 3 shards and RF=2, i.e. each replica gets its own Solr instance. Here's the output from my client app that traps the inconsistency:

          >>>>>>
          Found 1 shards with mis-matched doc counts.
          At January 16, 2014 12:18:08 PM MST
          shard2:

          { http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/ = 62984 LEADER http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/ = 62980 diff:4 }

          Details:
          shard2
          >> finished querying leader, found 62984 documents (62984)
          >> finished querying http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/, found 62980 documents
          Doc [182866] not found in replica: <doc boost="1.0"><field name="id">182866</field><field name="string_s">test</field><field name="int_i">-1257345242</field><field name="float_f">0.92657363</field><field name="double_d">0.5259114828332452</field><field name="text_en">this is a test</field><field name="version">1457415570117885953</field></doc>
          Doc [182859] not found in replica: <doc boost="1.0"><field name="id">182859</field><field name="string_s">test</field><field name="int_i">991366909</field><field name="float_f">0.5311716</field><field name="double_d">0.10846350752086309</field><field name="text_en">this is a test</field><field name="version">1457415570117885952</field></doc>
          Doc [182872] not found in replica: <doc boost="1.0"><field name="id">182872</field><field name="string_s">test</field><field name="int_i">824512897</field><field name="float_f">0.830366</field><field name="double_d">0.6560223698806142</field><field name="text_en">this is a test</field><field name="version">1457415570117885954</field></doc>
          Doc [182876] not found in replica: <doc boost="1.0"><field name="id">182876</field><field name="string_s">test</field><field name="int_i">-1657831473</field><field name="float_f">0.4877965</field><field name="double_d">0.9214420679315872</field><field name="text_en">this is a test</field><field name="version">1457415570117885955</field></doc>
          Sending hard commit after mis-match and then will wait for user to handle it ...
          <<<<<<

          So four missing docs: 182866, 182859, 182872, 182876

          Now I'm thinking this might be in the ConcurrentUpdateSolrServer logic. I added some detailed logging to show when JavabinLoader unmarshals a doc and when it is offered on the CUSS queue (to be sent to the replica). On the leader, here's the log around some messages that were lost:

          2014-01-16 14:16:37,534 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182857
          2014-01-16 14:16:37,534 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182857
          /////////////////////////////////////
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182859
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182859
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182866
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182866
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182872
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182872
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182876
          2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182876
          2014-01-16 14:16:37,558 [qtp417447538-17] INFO update.processor.LogUpdateProcessor - [test_shard2_replica2] webapp=/solr path=/update params=

          {wt=javabin&version=2}

          {add=[182704 (1457415570048679936), 182710 (1457415570049728512), 182711 (1457415570049728513), 182717 (1457415570056019968), 182720 (1457415570056019969), 182722 (1457415570057068544), 182723 (1457415570057068545), 182724 (1457415570058117120), 182730 (1457415570058117121), 182735 (1457415570059165696), ... (61 adds)]}

          0 72
          /////////////////////////////////////
          2014-01-16 14:16:37,764 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182880
          2014-01-16 14:16:37,764 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182880

          As you can see, the leader received doc with ID:182859 at 2014-01-16 14:16:37,552 and the queued it on the CUSS queue to be sent to the replica. On the replica, the log shows it receiving 182857 and then 182880 ... the 4 missing docs (182866, 182859, 182872, 182876) were definitely queued in CUSS on the leader. I've checked the logs on all the other replicas and the docs didn't go there either.

          2014-01-16 14:16:37,292 [qtp417447538-14] INFO handler.loader.JavabinLoader - test_shard2_replica1 add: 182857
          2014-01-16 14:16:37,293 [qtp417447538-14] INFO update.processor.LogUpdateProcessor - [test_shard2_replica1] webapp=/solr path=/update params=

          {distrib.from=http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/&update.distrib=FROMLEADER&wt=javabin&version=2}

          {add=[182841 (1457415570096914432), 182842 (1457415570096914433), 182843 (1457415570096914434), 182844 (1457415570096914435), 182846 (1457415570097963008), 182848 (1457415570097963009), 182850 (1457415570099011584), 182854 (1457415570099011585), 182857 (1457415570099011586)]}

          0 2
          2014-01-16 14:16:37,521 [qtp417447538-14] INFO handler.loader.JavabinLoader - test_shard2_replica1 add: 182880

          So it seems like a "batch" of docs queued on the leader just got missed ...

          Show
          Timothy Potter added a comment - I was able to reproduce this issue on EC2 without any over-sharding (on latest rev on branch_4x) ... basically 6 Solr nodes with 3 shards and RF=2, i.e. each replica gets its own Solr instance. Here's the output from my client app that traps the inconsistency: >>>>>> Found 1 shards with mis-matched doc counts. At January 16, 2014 12:18:08 PM MST shard2: { http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/ = 62984 LEADER http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/ = 62980 diff:4 } Details: shard2 >> finished querying leader, found 62984 documents (62984) >> finished querying http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test_shard2_replica1/ , found 62980 documents Doc [182866] not found in replica: <doc boost="1.0"><field name="id">182866</field><field name="string_s">test</field><field name="int_i">-1257345242</field><field name="float_f">0.92657363</field><field name="double_d">0.5259114828332452</field><field name="text_en">this is a test</field><field name=" version ">1457415570117885953</field></doc> Doc [182859] not found in replica: <doc boost="1.0"><field name="id">182859</field><field name="string_s">test</field><field name="int_i">991366909</field><field name="float_f">0.5311716</field><field name="double_d">0.10846350752086309</field><field name="text_en">this is a test</field><field name=" version ">1457415570117885952</field></doc> Doc [182872] not found in replica: <doc boost="1.0"><field name="id">182872</field><field name="string_s">test</field><field name="int_i">824512897</field><field name="float_f">0.830366</field><field name="double_d">0.6560223698806142</field><field name="text_en">this is a test</field><field name=" version ">1457415570117885954</field></doc> Doc [182876] not found in replica: <doc boost="1.0"><field name="id">182876</field><field name="string_s">test</field><field name="int_i">-1657831473</field><field name="float_f">0.4877965</field><field name="double_d">0.9214420679315872</field><field name="text_en">this is a test</field><field name=" version ">1457415570117885955</field></doc> Sending hard commit after mis-match and then will wait for user to handle it ... <<<<<< So four missing docs: 182866, 182859, 182872, 182876 Now I'm thinking this might be in the ConcurrentUpdateSolrServer logic. I added some detailed logging to show when JavabinLoader unmarshals a doc and when it is offered on the CUSS queue (to be sent to the replica). On the leader, here's the log around some messages that were lost: 2014-01-16 14:16:37,534 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182857 2014-01-16 14:16:37,534 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182857 ///////////////////////////////////// 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182859 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182859 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182866 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182866 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182872 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182872 2014-01-16 14:16:37,552 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182876 2014-01-16 14:16:37,552 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182876 2014-01-16 14:16:37,558 [qtp417447538-17] INFO update.processor.LogUpdateProcessor - [test_shard2_replica2] webapp=/solr path=/update params= {wt=javabin&version=2} {add=[182704 (1457415570048679936), 182710 (1457415570049728512), 182711 (1457415570049728513), 182717 (1457415570056019968), 182720 (1457415570056019969), 182722 (1457415570057068544), 182723 (1457415570057068545), 182724 (1457415570058117120), 182730 (1457415570058117121), 182735 (1457415570059165696), ... (61 adds)]} 0 72 ///////////////////////////////////// 2014-01-16 14:16:37,764 [qtp417447538-17] INFO handler.loader.JavabinLoader - test_shard2_replica2 add: 182880 2014-01-16 14:16:37,764 [qtp417447538-17] INFO solrj.impl.ConcurrentUpdateSolrServer - test_shard2_replica2 queued: 182880 As you can see, the leader received doc with ID:182859 at 2014-01-16 14:16:37,552 and the queued it on the CUSS queue to be sent to the replica. On the replica, the log shows it receiving 182857 and then 182880 ... the 4 missing docs (182866, 182859, 182872, 182876) were definitely queued in CUSS on the leader. I've checked the logs on all the other replicas and the docs didn't go there either. 2014-01-16 14:16:37,292 [qtp417447538-14] INFO handler.loader.JavabinLoader - test_shard2_replica1 add: 182857 2014-01-16 14:16:37,293 [qtp417447538-14] INFO update.processor.LogUpdateProcessor - [test_shard2_replica1] webapp=/solr path=/update params= {distrib.from=http://ec2-54-236-245-61.compute-1.amazonaws.com:8985/solr/test_shard2_replica2/&update.distrib=FROMLEADER&wt=javabin&version=2} {add=[182841 (1457415570096914432), 182842 (1457415570096914433), 182843 (1457415570096914434), 182844 (1457415570096914435), 182846 (1457415570097963008), 182848 (1457415570097963009), 182850 (1457415570099011584), 182854 (1457415570099011585), 182857 (1457415570099011586)]} 0 2 2014-01-16 14:16:37,521 [qtp417447538-14] INFO handler.loader.JavabinLoader - test_shard2_replica1 add: 182880 So it seems like a "batch" of docs queued on the leader just got missed ...
          Hide
          Shikhar Bhushan added a comment -

          This may be unrelated - I have not done much digging or looked at the full context, but was just looking at CUSS out of curiosity.

          Why do we flush() the OutputStream, but then write() on stuff like ending tags? Shouldn't the flush be after all those writes()'s?

          https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java#L205

          Show
          Shikhar Bhushan added a comment - This may be unrelated - I have not done much digging or looked at the full context, but was just looking at CUSS out of curiosity. Why do we flush() the OutputStream, but then write() on stuff like ending tags? Shouldn't the flush be after all those writes()'s? https://github.com/apache/lucene-solr/blob/lucene_solr_4_6/solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java#L205
          Hide
          Mark Miller added a comment -

          I've checked the logs on all the other replicas and the docs didn't go there either.

          So strange - it would be a different CUSS instance used for each server....

          Show
          Mark Miller added a comment - I've checked the logs on all the other replicas and the docs didn't go there either. So strange - it would be a different CUSS instance used for each server....
          Hide
          Joel Bernstein added a comment -

          That was a blind alley, a faulty test was causing the effect I described above.

          Show
          Joel Bernstein added a comment - That was a blind alley, a faulty test was causing the effect I described above.
          Hide
          Mark Miller added a comment - - edited

          I have various theories, but without a test that fails, it's hard to test out anything - so I've been putting most of my efforts into a unit test that can get this, but it's been surprisingly difficult for me to trigger in a test.

          Show
          Mark Miller added a comment - - edited I have various theories, but without a test that fails, it's hard to test out anything - so I've been putting most of my efforts into a unit test that can get this, but it's been surprisingly difficult for me to trigger in a test.
          Hide
          Timothy Potter added a comment - - edited

          So strange - it would be a different CUSS instance used for each server....

          right, I was just mentioning that I did check to make sure there wasn't a bug in the routing logic or anything like that but I now see that was silly because it wouldn't be able to go to the other replicas because the message was on the correct queue

          agreed on the need for a unit test to reproduce this and am working on the same.

          Show
          Timothy Potter added a comment - - edited So strange - it would be a different CUSS instance used for each server.... right, I was just mentioning that I did check to make sure there wasn't a bug in the routing logic or anything like that but I now see that was silly because it wouldn't be able to go to the other replicas because the message was on the correct queue agreed on the need for a unit test to reproduce this and am working on the same.
          Hide
          Timothy Potter added a comment -

          Added some more logging on the leader ... as a bit of context, the replica received doc with ID 41029 and then 41041 and didn't receive 41033 and 41038 in between ... here's the log on the leader of activity between 41029 and then 41041.

          2014-01-16 16:03:02,523 [updateExecutor-1-thread-1] INFO solrj.impl.ConcurrentUpdateSolrServer - sent docs to http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 , 41003, 41
          005, 41007, 41010, 41014, 41015, 41026, 41029
          2014-01-16 16:03:02,527 [qtp417447538-16] INFO handler.loader.JavabinLoader - test3_shard3_replica2 add: 41033
          2014-01-16 16:03:02,527 [qtp417447538-16] INFO update.processor.DistributedUpdateProcessor - doLocalAdd 41033
          2014-01-16 16:03:02,527 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - test3_shard3_replica2 queued (to: http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1): 41033
          2014-01-16 16:03:02,528 [qtp417447538-16] INFO handler.loader.JavabinLoader - test3_shard3_replica2 add: 41038
          2014-01-16 16:03:02,528 [qtp417447538-16] INFO update.processor.DistributedUpdateProcessor - doLocalAdd 41038
          2014-01-16 16:03:02,528 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - test3_shard3_replica2 queued (to: http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1): 41038
          2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - blockUntilFinished starting http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1
          2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - blockUntilFinished is done for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1
          2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - shutting down CUSS for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1
          2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - shut down CUSS for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1

          Not quite sure what this means but I think you're hunch about blockUntilFinished being involved is getting warmer

          Show
          Timothy Potter added a comment - Added some more logging on the leader ... as a bit of context, the replica received doc with ID 41029 and then 41041 and didn't receive 41033 and 41038 in between ... here's the log on the leader of activity between 41029 and then 41041. 2014-01-16 16:03:02,523 [updateExecutor-1-thread-1] INFO solrj.impl.ConcurrentUpdateSolrServer - sent docs to http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 , 41003, 41 005, 41007, 41010, 41014, 41015, 41026, 41029 2014-01-16 16:03:02,527 [qtp417447538-16] INFO handler.loader.JavabinLoader - test3_shard3_replica2 add: 41033 2014-01-16 16:03:02,527 [qtp417447538-16] INFO update.processor.DistributedUpdateProcessor - doLocalAdd 41033 2014-01-16 16:03:02,527 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - test3_shard3_replica2 queued (to: http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1): 41033 2014-01-16 16:03:02,528 [qtp417447538-16] INFO handler.loader.JavabinLoader - test3_shard3_replica2 add: 41038 2014-01-16 16:03:02,528 [qtp417447538-16] INFO update.processor.DistributedUpdateProcessor - doLocalAdd 41038 2014-01-16 16:03:02,528 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - test3_shard3_replica2 queued (to: http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1): 41038 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - blockUntilFinished starting http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - blockUntilFinished is done for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - shutting down CUSS for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 2014-01-16 16:03:02,559 [qtp417447538-16] INFO solrj.impl.ConcurrentUpdateSolrServer - shut down CUSS for http://ec2-107-21-55-0.compute-1.amazonaws.com:8985/solr/test3_shard3_replica1 Not quite sure what this means but I think you're hunch about blockUntilFinished being involved is getting warmer
          Hide
          Mark Miller added a comment -

          For a long time, I've wanted to try putting in a check that the queue is empty as well for blockUntilFinished when we use it in this case - I just need a test that sees this so I can check if it works

          Without that, it seems there is a window where we can bail before we are done sending everything in the queue. Shutdown doesn't help much, because it can't even wait for the executor to shutdown in this case.

          Show
          Mark Miller added a comment - For a long time, I've wanted to try putting in a check that the queue is empty as well for blockUntilFinished when we use it in this case - I just need a test that sees this so I can check if it works Without that, it seems there is a window where we can bail before we are done sending everything in the queue. Shutdown doesn't help much, because it can't even wait for the executor to shutdown in this case.
          Hide
          Mark Miller added a comment -

          Patch attached that does the above.

          Show
          Mark Miller added a comment - Patch attached that does the above.
          Hide
          Mark Miller added a comment -

          That's all I have come up with so far - though I'm not even completely sold on it. Because we are using CUSS with a single thread, all the previous doc adds should have hit the request method and so a Runner should be going for them if necessary.

          It's all pretty tricky logic to understand clearly though.

          Show
          Mark Miller added a comment - That's all I have come up with so far - though I'm not even completely sold on it. Because we are using CUSS with a single thread, all the previous doc adds should have hit the request method and so a Runner should be going for them if necessary. It's all pretty tricky logic to understand clearly though.
          Hide
          Timothy Potter added a comment -

          So far so good, Mark! I applied the patch to latest rev of branch_4x and have indexed about 3M docs without hitting the issue, before the patch, I would see this issue within a few minutes. So jury is still out and I'll keep stress testing it, but looks promising. Nice work!

          Show
          Timothy Potter added a comment - So far so good, Mark! I applied the patch to latest rev of branch_4x and have indexed about 3M docs without hitting the issue, before the patch, I would see this issue within a few minutes. So jury is still out and I'll keep stress testing it, but looks promising. Nice work!
          Hide
          Timothy Potter added a comment -

          Did another couple of million docs in an oversharded env. 24 replicas on 6 nodes (m1.mediums so I didn't want to overload them too much) ... still looking good.

          Show
          Timothy Potter added a comment - Did another couple of million docs in an oversharded env. 24 replicas on 6 nodes (m1.mediums so I didn't want to overload them too much) ... still looking good.
          Hide
          Joel Bernstein added a comment -

          I installed the patch and ran it.

          I'm getting some intermittent null pointers:

          1578995 [qtp433857665-17] ERROR org.apache.solr.servlet.SolrDispatchFilter – null:java.lang.NullPointerException
          at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:401)
          at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:99)
          at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:69)
          at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:606)
          at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1449)
          at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
          at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
          at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
          at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915)
          at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:764)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
          at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203)
          at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)

          Show
          Joel Bernstein added a comment - I installed the patch and ran it. I'm getting some intermittent null pointers: 1578995 [qtp433857665-17] ERROR org.apache.solr.servlet.SolrDispatchFilter – null:java.lang.NullPointerException at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:401) at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:99) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:69) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:606) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1449) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1915) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:764) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:203) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
          Hide
          Joel Bernstein added a comment - - edited

          In the code snippet below it looks like this line is the culprit:

          if ((runner == null && queue.isEmpty()) || scheduler.isTerminated())
          
           public synchronized void blockUntilFinished(boolean waitForEmptyQueue) {
              lock = new CountDownLatch(1);
              try {
                // Wait until no runners are running
                for (;;) {
                  Runner runner;
                  synchronized (runners) {
                    runner = runners.peek();
                  }
                  if (waitForEmptyQueue) {
                    if ((runner == null && queue.isEmpty()) || scheduler.isTerminated())
                      break;
                  } else {
                    if (runner == null || scheduler.isTerminated())
                      break;
                  }
                  runner.runnerLock.lock();
                  runner.runnerLock.unlock();
                }
              } finally {
                lock.countDown();
                lock = null;
              }
            }
          
          Show
          Joel Bernstein added a comment - - edited In the code snippet below it looks like this line is the culprit: if ((runner == null && queue.isEmpty()) || scheduler.isTerminated()) public synchronized void blockUntilFinished( boolean waitForEmptyQueue) { lock = new CountDownLatch(1); try { // Wait until no runners are running for (;;) { Runner runner; synchronized (runners) { runner = runners.peek(); } if (waitForEmptyQueue) { if ((runner == null && queue.isEmpty()) || scheduler.isTerminated()) break ; } else { if (runner == null || scheduler.isTerminated()) break ; } runner.runnerLock.lock(); runner.runnerLock.unlock(); } } finally { lock.countDown(); lock = null ; } }
          Hide
          Mark Miller added a comment -

          Strange Joel - queue and scheduler are both final and set in the constructor.

          Show
          Mark Miller added a comment - Strange Joel - queue and scheduler are both final and set in the constructor.
          Hide
          Joel Bernstein added a comment -

          It's actually the runner that is null:

          runner.runnerLock.lock();
          

          The conditions in this statement have changed and I think made it possible for the null pointer to appear.

          if ((runner == null && queue.isEmpty()) || scheduler.isTerminated())
          
          Show
          Joel Bernstein added a comment - It's actually the runner that is null: runner.runnerLock.lock(); The conditions in this statement have changed and I think made it possible for the null pointer to appear. if ((runner == null && queue.isEmpty()) || scheduler.isTerminated())
          Hide
          ASF subversion and git services added a comment -

          Commit 1558978 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1558978 ]

          SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost.

          Show
          ASF subversion and git services added a comment - Commit 1558978 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1558978 ] SOLR-4260 : ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost.
          Hide
          Mark Miller added a comment -

          Well, this is important for 4.6.1 - given Potter's feedback, in it goes. Please help test and review this guys. Especially around this possible NPE.

          Show
          Mark Miller added a comment - Well, this is important for 4.6.1 - given Potter's feedback, in it goes. Please help test and review this guys. Especially around this possible NPE.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558979 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558979 ]

          SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost.

          Show
          ASF subversion and git services added a comment - Commit 1558979 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558979 ] SOLR-4260 : ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558980 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1558980 ]

          SOLR-4260: Add name to CHANGES

          Show
          ASF subversion and git services added a comment - Commit 1558980 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1558980 ] SOLR-4260 : Add name to CHANGES
          Hide
          ASF subversion and git services added a comment -

          Commit 1558981 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558981 ]

          SOLR-4260: Add name to CHANGES

          Show
          ASF subversion and git services added a comment - Commit 1558981 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558981 ] SOLR-4260 : Add name to CHANGES
          Hide
          Mark Miller added a comment -

          The conditions in this statement have changed and I think made it possible for the null pointer to appear.

          Ah, nice - thanks. I had already made some changes so couldn't line up the src lines - thought you meant the line that was the culprit was the one that the NPE came from.

          I'll take a closer look.

          Show
          Mark Miller added a comment - The conditions in this statement have changed and I think made it possible for the null pointer to appear. Ah, nice - thanks. I had already made some changes so couldn't line up the src lines - thought you meant the line that was the culprit was the one that the NPE came from. I'll take a closer look.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558982 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
          [ https://svn.apache.org/r1558982 ]

          SOLR-4260: ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost.

          Show
          ASF subversion and git services added a comment - Commit 1558982 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558982 ] SOLR-4260 : ConcurrentUpdateSolrServer#blockUntilFinished can return before all previously added updates have finished. This could cause distributed updates meant for replicas to be lost.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558983 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
          [ https://svn.apache.org/r1558983 ]

          SOLR-4260: Add name to CHANGES

          Show
          ASF subversion and git services added a comment - Commit 1558983 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558983 ] SOLR-4260 : Add name to CHANGES
          Hide
          ASF subversion and git services added a comment -

          Commit 1558985 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1558985 ]

          SOLR-4260: Guard against NPE.

          Show
          ASF subversion and git services added a comment - Commit 1558985 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1558985 ] SOLR-4260 : Guard against NPE.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558986 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558986 ]

          SOLR-4260: Guard against NPE.

          Show
          ASF subversion and git services added a comment - Commit 1558986 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558986 ] SOLR-4260 : Guard against NPE.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558988 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
          [ https://svn.apache.org/r1558988 ]

          SOLR-4260: Guard against NPE.

          Show
          ASF subversion and git services added a comment - Commit 1558988 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558988 ] SOLR-4260 : Guard against NPE.
          Hide
          Mark Miller added a comment -

          ChaosMonkeyNothingIsSafeTest is exposing an issue now with ConcurrentUpdateSolrServer - it looks like it's getting stuck in blockUntilFinished because the queue is not empty and no runners are being spawned to empty it.

          It may be that NPE that would occurred before in this case just kept the docs from being lost 'silently', and this is closer to the actual bug?

          Show
          Mark Miller added a comment - ChaosMonkeyNothingIsSafeTest is exposing an issue now with ConcurrentUpdateSolrServer - it looks like it's getting stuck in blockUntilFinished because the queue is not empty and no runners are being spawned to empty it. It may be that NPE that would occurred before in this case just kept the docs from being lost 'silently', and this is closer to the actual bug?
          Hide
          ASF subversion and git services added a comment -

          Commit 1558996 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1558996 ]

          SOLR-4260: If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner.

          Show
          ASF subversion and git services added a comment - Commit 1558996 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1558996 ] SOLR-4260 : If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558997 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1558997 ]

          SOLR-4260: If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner.

          Show
          ASF subversion and git services added a comment - Commit 1558997 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1558997 ] SOLR-4260 : If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner.
          Hide
          Mark Miller added a comment -

          Committed something for that.

          As a separate issue, it seems to me that CUSS#shutdown should probably call blockUntilFinished as it's first order of business.

          Show
          Mark Miller added a comment - Committed something for that. As a separate issue, it seems to me that CUSS#shutdown should probably call blockUntilFinished as it's first order of business.
          Hide
          ASF subversion and git services added a comment -

          Commit 1558998 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
          [ https://svn.apache.org/r1558998 ]

          SOLR-4260: If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner.

          Show
          ASF subversion and git services added a comment - Commit 1558998 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1558998 ] SOLR-4260 : If in blockUntilFinished and there are no Runners running and the queue is not empty, start a new Runner.
          Hide
          Mark Miller added a comment -

          This is a fine fix for SolrCloud, especially for 4.6.1 - but there may be a better general fix hidden still - what seems to happen is that we have docs that enter the queue that don't spawn a runner. The current fix means docs can be added that will sit in the queue until you call blockUntilFinished.

          Show
          Mark Miller added a comment - This is a fine fix for SolrCloud, especially for 4.6.1 - but there may be a better general fix hidden still - what seems to happen is that we have docs that enter the queue that don't spawn a runner. The current fix means docs can be added that will sit in the queue until you call blockUntilFinished.
          Hide
          Shawn Heisey added a comment -

          This might be old news by now, but I noticed it while updating my test system, so I'm reporting it.

          The lucene_solr_4_6 branch fails to compile with these fixes committed. One of the changes removes the import for RemoteSolrException from SolrCmdDistributor, but the doRetries method still uses this exception. That method is very different in 4.6 than it is in branch_4x. Everything's good on branch_4x. Re-adding the import fixes the problem, but the discrepancy between the two branches needs some investigation.

          The specific code that fails to compile with the removed import seems to have been initially added to trunk by revision 1545464 (2013/11/25) and removed from trunk by revision 1546670 (2013/11/29). It was then re-added to lucene_solr_4_6 by revision 1554122 (2013/12/29).

          Show
          Shawn Heisey added a comment - This might be old news by now, but I noticed it while updating my test system, so I'm reporting it. The lucene_solr_4_6 branch fails to compile with these fixes committed. One of the changes removes the import for RemoteSolrException from SolrCmdDistributor, but the doRetries method still uses this exception. That method is very different in 4.6 than it is in branch_4x. Everything's good on branch_4x. Re-adding the import fixes the problem, but the discrepancy between the two branches needs some investigation. The specific code that fails to compile with the removed import seems to have been initially added to trunk by revision 1545464 (2013/11/25) and removed from trunk by revision 1546670 (2013/11/29). It was then re-added to lucene_solr_4_6 by revision 1554122 (2013/12/29).
          Hide
          Markus Jelsma added a comment -

          I believe the whole building now knows i cannot reproduce the problem!

          Show
          Markus Jelsma added a comment - I believe the whole building now knows i cannot reproduce the problem!
          Hide
          Mikhail Khludnev added a comment -

          What a great hunt, guys! Thanks a lot!

          Show
          Mikhail Khludnev added a comment - What a great hunt, guys! Thanks a lot!
          Hide
          Joel Bernstein added a comment -

          Ok, just had two clean test runs with trunk. The NPE is no longer occurring and the leaders and replicas are in sync. Running through some more stress tests this morning, but so far so good.

          Show
          Joel Bernstein added a comment - Ok, just had two clean test runs with trunk. The NPE is no longer occurring and the leaders and replicas are in sync. Running through some more stress tests this morning, but so far so good.
          Hide
          ASF subversion and git services added a comment -

          Commit 1559125 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
          [ https://svn.apache.org/r1559125 ]

          SOLR-4260: Bring back import still used on 4.6 branch.

          Show
          ASF subversion and git services added a comment - Commit 1559125 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1559125 ] SOLR-4260 : Bring back import still used on 4.6 branch.
          Hide
          Mark Miller added a comment -

          Thanks Shawn - fixed.

          Show
          Mark Miller added a comment - Thanks Shawn - fixed.
          Hide
          Mark Miller added a comment - - edited

          Thanks everyone. I'll make a new JIRA issue to properly fix this. I'm not sure we should remove this logic, it's a good failsafe, but ideally, we don't want to run out of runners when there are still updates in the queue. Calling blockUntilFinished is not supposed to be required to make sure the queue is emptied.

          Show
          Mark Miller added a comment - - edited Thanks everyone. I'll make a new JIRA issue to properly fix this. I'm not sure we should remove this logic, it's a good failsafe, but ideally, we don't want to run out of runners when there are still updates in the queue. Calling blockUntilFinished is not supposed to be required to make sure the queue is emptied.
          Hide
          Mark Miller added a comment -

          Calling this done for 4.6.1. Let's open a new issue for anything further.

          Show
          Mark Miller added a comment - Calling this done for 4.6.1. Let's open a new issue for anything further.
          Hide
          Markus Jelsma added a comment - - edited

          ignore apparently one node did not receive the update. forgive my stupidity

          Show
          Markus Jelsma added a comment - - edited ignore apparently one node did not receive the update. forgive my stupidity
          Hide
          Hari Sekhon added a comment - - edited

          I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket when running on HDFS, it might be a separate issue, raised as SOLR-7395.

          Show
          Hari Sekhon added a comment - - edited I've seen discrepancies between leader and followers of much higher numbers on newer versions of Solr than in this ticket when running on HDFS, it might be a separate issue, raised as SOLR-7395 .
          Hide
          Mark Miller added a comment -

          This ticket addressed specific issues - please open a new ticket for any further reports.

          Show
          Mark Miller added a comment - This ticket addressed specific issues - please open a new ticket for any further reports.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Markus Jelsma
            • Votes:
              6 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development