Cassandra
  1. Cassandra
  2. CASSANDRA-1093

BinaryMemtable interface silently dropping data.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 0.6.4
    • Component/s: Core
    • Labels:
      None
    • Environment:

      Linux Centos5, Fedora Core 4. Java HotSpot Server 1.6.0_14. See readme for more details.

      Description

      I've been attempting to use the Binary Memtable (BMT) interface to load a large number of rows. During my testing, I discovered that on larger loads (~1 million rows), occasionally some of the data never appears in the database. This happens in a non-deterministic manner, as sometimes all the data loads fine, and other times a significant chunk goes missing. No errors are ever logged to indicate a problem. I'm attaching some sample code that approximates my application's usage of Cassandra and explains this bug in more detail.

      1. cassandra_bmt_test.tar.gz
        11 kB
        Toby Jungen
      2. 1093.txt
        3 kB
        Jonathan Ellis

        Activity

        Hide
        Jonathan Ellis added a comment -

        committed, with additional note that wait-for-acks can reduce throughput

        Show
        Jonathan Ellis added a comment - committed, with additional note that wait-for-acks can reduce throughput
        Hide
        Toby Jungen added a comment -

        Applied and tested the patch, appears to solve the problem. Haven't run multiple tests yet to make sure, but looks good so far. Obviously, this slows down the write, but that's an acceptable loss. It's likely still faster and more efficient than using the thrift API.

        I'll be out of my office for the next three weeks, but I'll try to test more when I get back. Feel free to mark as resolved in the mean time.

        Show
        Toby Jungen added a comment - Applied and tested the patch, appears to solve the problem. Haven't run multiple tests yet to make sure, but looks good so far. Obviously, this slows down the write, but that's an acceptable loss. It's likely still faster and more efficient than using the thrift API. I'll be out of my office for the next three weeks, but I'll try to test more when I get back. Feel free to mark as resolved in the mean time.
        Hide
        Jonathan Ellis added a comment -

        patch to add response from BinaryVerbHandler, and updates bmt_example to use sendRR

        Show
        Jonathan Ellis added a comment - patch to add response from BinaryVerbHandler, and updates bmt_example to use sendRR
        Hide
        Jonathan Ellis added a comment -

        As it happens, Riptano has a client that is running into this too, so I'll take a stab at fixing it.

        Show
        Jonathan Ellis added a comment - As it happens, Riptano has a client that is running into this too, so I'll take a stab at fixing it.
        Hide
        Brandon Williams added a comment -

        If it is a node is being errantly marked down, in 0.6.3 or later you can try increasing the PhiConvictThreshold configuration directive and see if that helps. EC2 users are setting it to 10 or 11, 8 is the default.

        Show
        Brandon Williams added a comment - If it is a node is being errantly marked down, in 0.6.3 or later you can try increasing the PhiConvictThreshold configuration directive and see if that helps. EC2 users are setting it to 10 or 11, 8 is the default.
        Hide
        Toby Jungen added a comment -

        Thanks for the insight Jonathan. That was my intuition as well, and I observed my cluster periodically marking nodes as down for a second or two. I figured it was random network hiccups, since our network hardware is rather old. It would make sense that these periodic interruptions caused the BMT to lose data.

        While looking through the code, I did try to see if I could use BMT with the blocking MessagingService API (in the way the Thrift API works unless ConsistencyLevel.ZERO is specified), but it looks like BMT is hardcoded to be asynchronous. It might be nice for that option to be there, but since this issue appears to only affect me (and I no longer need to use BMT for my purposes), it's a super-low priority suggestion.

        Show
        Toby Jungen added a comment - Thanks for the insight Jonathan. That was my intuition as well, and I observed my cluster periodically marking nodes as down for a second or two. I figured it was random network hiccups, since our network hardware is rather old. It would make sense that these periodic interruptions caused the BMT to lose data. While looking through the code, I did try to see if I could use BMT with the blocking MessagingService API (in the way the Thrift API works unless ConsistencyLevel.ZERO is specified), but it looks like BMT is hardcoded to be asynchronous. It might be nice for that option to be there, but since this issue appears to only affect me (and I no longer need to use BMT for my purposes), it's a super-low priority suggestion.
        Hide
        Jonathan Ellis added a comment -

        BMT is a very fire-and-forget api, so any failure condition will cause messages to be dropped with no way of knowing.

        Probably the most likely one is, under heavy load (network and/or cpu) it's reasonably common for one node in the cluster to be marked "down" incorrectly by other nodes in the cluster. This causes any messages on the MessagingService queue to that node to be dropped summarily, and the pool connection to be re-attempted when the failure detector believes it is "up" again. (See OutboundTcpConnectionPool.reset)

        Show
        Jonathan Ellis added a comment - BMT is a very fire-and-forget api, so any failure condition will cause messages to be dropped with no way of knowing. Probably the most likely one is, under heavy load (network and/or cpu) it's reasonably common for one node in the cluster to be marked "down" incorrectly by other nodes in the cluster. This causes any messages on the MessagingService queue to that node to be dropped summarily, and the pool connection to be re-attempted when the failure detector believes it is "up" again. (See OutboundTcpConnectionPool.reset)
        Hide
        Brandon Williams added a comment -

        No, that shouldn't happen since the timestamps for columns are supplied by the client.

        Show
        Brandon Williams added a comment - No, that shouldn't happen since the timestamps for columns are supplied by the client.
        Hide
        Toby Jungen added a comment -

        Yep, I'm waiting until I see the flush message in the log. It reads something like "BinaryMemtable@7a82b flushed to disk".

        One thing I'm thinking may be causing problems is my nodes being out of sync time-wise. I'll have to verify their clocks, but is it possible that if the clocks differ significantly that values get lost?

        Show
        Toby Jungen added a comment - Yep, I'm waiting until I see the flush message in the log. It reads something like "BinaryMemtable@7a82b flushed to disk". One thing I'm thinking may be causing problems is my nodes being out of sync time-wise. I'll have to verify their clocks, but is it possible that if the clocks differ significantly that values get lost?
        Hide
        Brandon Williams added a comment -

        I did another 100K run and it passed:

        Processed 4547169 values.
        Done.
        Missing documents: 0
        Mismatched documents: 0
        Missing index entries: 0
        Wrong-sized index entries: 0
        Mismatched index entries: 0

        Is it possible you aren't waiting long enough for the flush to complete? (nodetool doesn't block on the flush command, you have to watch the system.log)

        Show
        Brandon Williams added a comment - I did another 100K run and it passed: Processed 4547169 values. Done. Missing documents: 0 Mismatched documents: 0 Missing index entries: 0 Wrong-sized index entries: 0 Mismatched index entries: 0 Is it possible you aren't waiting long enough for the flush to complete? (nodetool doesn't block on the flush command, you have to watch the system.log)
        Hide
        Toby Jungen added a comment - - edited

        Looks like everything passed. You may want to re-run from start to finish one or two more times (my error didn't occur consistently), but if it still passes at that point then close this issue as CannotReproduce and I'll attribute the problem to my hardware setup. As mentioned I've found somewhat of a workaround. I'll gladly donate my test code as a possible unit test for BMT if needed.

        Show
        Toby Jungen added a comment - - edited Looks like everything passed. You may want to re-run from start to finish one or two more times (my error didn't occur consistently), but if it still passes at that point then close this issue as CannotReproduce and I'll attribute the problem to my hardware setup. As mentioned I've found somewhat of a workaround. I'll gladly donate my test code as a possible unit test for BMT if needed.
        Hide
        Brandon Williams added a comment -

        I used 100K "documents":

        Processed 4547169 values.
        Done.
        Missing documents: 0
        Mismatched documents: 0
        Missing index entries: 0
        Wrong-sized index entries: 0
        Mismatched index entries: 0

        Show
        Brandon Williams added a comment - I used 100K "documents": Processed 4547169 values. Done. Missing documents: 0 Mismatched documents: 0 Missing index entries: 0 Wrong-sized index entries: 0 Mismatched index entries: 0
        Hide
        Toby Jungen added a comment -

        I've been able to observe the error with a generate parameter of 25,000. Note that the generate step creates the entire randomized data set in memory before writing it to disk, so this test is limited by memory. With a parameter of 25,000 I ran fine with 512MB of heap space, at 100,000 I'd expect you to need around 2GB of heap space.

        The parameter for the generate step corresponds to a "document", and each document results in roughly 100 rows.

        Show
        Toby Jungen added a comment - I've been able to observe the error with a generate parameter of 25,000. Note that the generate step creates the entire randomized data set in memory before writing it to disk, so this test is limited by memory. With a parameter of 25,000 I ran fine with 512MB of heap space, at 100,000 I'd expect you to need around 2GB of heap space. The parameter for the generate step corresponds to a "document", and each document results in roughly 100 rows.
        Hide
        Brandon Williams added a comment -

        I can't get it past the generate step without an OOM:

        cassandra_bmt_test# java -jar -Xmx4096m build/cassandra-bmt-test.jar generate foo 1000000
        Generating data...
        Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.ArrayList.<init>(ArrayList.java:132)
        at java.util.ArrayList.<init>(ArrayList.java:139)
        at CassandraBMTTest.generateData(Unknown Source)
        at CassandraBMTTest.main(Unknown Source)

        I'm going to try with < 1M and see if that works.

        Show
        Brandon Williams added a comment - I can't get it past the generate step without an OOM: cassandra_bmt_test# java -jar -Xmx4096m build/cassandra-bmt-test.jar generate foo 1000000 Generating data... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.ArrayList.<init>(ArrayList.java:132) at java.util.ArrayList.<init>(ArrayList.java:139) at CassandraBMTTest.generateData(Unknown Source) at CassandraBMTTest.main(Unknown Source) I'm going to try with < 1M and see if that works.
        Hide
        Jonathan Ellis added a comment -

        can you reproduce using Toby's code, Brandon?

        Show
        Jonathan Ellis added a comment - can you reproduce using Toby's code, Brandon?
        Hide
        Toby Jungen added a comment -

        Yes, I'm flushing each node after the import. I've also tried flushing the system keyspace (no effect). As noted in the readme, I would not be surprised if this problem is unique to my hardware/software configuration and isn't an inherent problem with Cassandra's BMT interface.

        For what it's worth, I've hacked together a "workaround" for this problem by writing SSTables directly (using o.a.c.io.SSTableWriter), copying the generated files to appropriate directories on the nodes, and then restarting the nodes. This solution is bound to result in other bugs, but for now I've verified that there is no lost data with this method.

        Show
        Toby Jungen added a comment - Yes, I'm flushing each node after the import. I've also tried flushing the system keyspace (no effect). As noted in the readme, I would not be surprised if this problem is unique to my hardware/software configuration and isn't an inherent problem with Cassandra's BMT interface. For what it's worth, I've hacked together a "workaround" for this problem by writing SSTables directly (using o.a.c.io.SSTableWriter), copying the generated files to appropriate directories on the nodes, and then restarting the nodes. This solution is bound to result in other bugs, but for now I've verified that there is no lost data with this method.
        Hide
        Chris Goffinet added a comment -

        Tonight I imported 1M rows and verified all rows existed.

        Show
        Chris Goffinet added a comment - Tonight I imported 1M rows and verified all rows existed.
        Hide
        Chris Goffinet added a comment -

        I've never seen this happen and I've done many imports. At the very end of the import, are you calling nodetool flush <Keyspace> ?

        Show
        Chris Goffinet added a comment - I've never seen this happen and I've done many imports. At the very end of the import, are you calling nodetool flush <Keyspace> ?
        Hide
        Toby Jungen added a comment -

        Sample code and instructions for how to run. See readme.txt in the archive.

        Show
        Toby Jungen added a comment - Sample code and instructions for how to run. See readme.txt in the archive.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Toby Jungen
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development