Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-2755

ColumnFamilyRecordWriter fails to throw a write exception encountered after the user begins to close the writer

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 0.7.7, 0.8.2
    • Component/s: None
    • Labels:
      None

      Description

      There appears to be a race condition in ColumnFamilyRecordWriter that can result in the loss of an exception. Here is how it can happen (W stands for the RangeClient's worker thread; U stands for the ColumnFamilyRecordWriter user's thread):

      1. W: RangeClient's run method catches an exception originating in the Thrift client/socket, but doesn't get a chance to set it on the lastException field before it the thread is preempted.
      2. U: The user calls close which calls stopNicely. Because the lastException field is null, stopNicely does not throw anything. close then joins on the worker thread.
      3. W: The RangeClient's run method sets the lastException field and exits.
      4. U: Although the thread in close is waiting for the worker thread to exit, it has already checked the lastException field so it doesn't detect the presence of the last exception. Instead, close returns without throwing anything.

      This race condition means that intermittently write failures will go undetected.

      1. CASSANDRA-2755.patch
        3 kB
        mck
      2. 2755-v2.txt
        4 kB
        Jonathan Ellis

        Activity

        Hide
        michaelsembwever mck added a comment -

        In RangeClient i cannot see why close() needs to be called before lastException is assigned. The following patch should work: I have tested it against various jobs but i have no reproducible testcase to confirm this bug against.

        Also in the patch is a slight cleanup to ColumnFamilyRecordWriter's close() methods: keeping implementation out of deprecated methods.

        Show
        michaelsembwever mck added a comment - In RangeClient i cannot see why close() needs to be called before lastException is assigned. The following patch should work: I have tested it against various jobs but i have no reproducible testcase to confirm this bug against. Also in the patch is a slight cleanup to ColumnFamilyRecordWriter's close() methods: keeping implementation out of deprecated methods.
        Hide
        jbellis Jonathan Ellis added a comment -

        It looks to me that as long as we check for the exception before calling join, there will be a window to miss one.

        v2 encapsulates RangeClient.close better to avoid this.

        Show
        jbellis Jonathan Ellis added a comment - It looks to me that as long as we check for the exception before calling join, there will be a window to miss one. v2 encapsulates RangeClient.close better to avoid this.
        Hide
        michaelsembwever mck added a comment -

        The check for the exception also occurs in ColumnFamilyRecordWriter.write(buf, value) -> RangeClient.put(pair)
        Isn't it possible the put(..) is being called while the RangeClient thread is inside close() ?
        (isn't write(..) called more often than close() ?)

        For this reason inside RangeClient.run() i assigned lastException before calling close()

        Show
        michaelsembwever mck added a comment - The check for the exception also occurs in ColumnFamilyRecordWriter.write(buf, value) -> RangeClient.put(pair) Isn't it possible the put(..) is being called while the RangeClient thread is inside close() ? (isn't write(..) called more often than close() ?) For this reason inside RangeClient.run() i assigned lastException before calling close()
        Hide
        jbellis Jonathan Ellis added a comment -

        Isn't it possible the put(..) is being called while the RangeClient thread is inside close?

        old close, new closeInternal?

        Yes, but I don't see how that changes things. I.e., it's always possible that the last put() will happen before an exception is set; hence, the extra check on close.

        Show
        jbellis Jonathan Ellis added a comment - Isn't it possible the put(..) is being called while the RangeClient thread is inside close? old close, new closeInternal? Yes, but I don't see how that changes things. I.e., it's always possible that the last put() will happen before an exception is set; hence, the extra check on close.
        Hide
        michaelsembwever mck added a comment -

        it's always possible that the last put() will happen before an exception is set; hence, the extra check on close.

        Quite right.

        Show
        michaelsembwever mck added a comment - it's always possible that the last put() will happen before an exception is set; hence, the extra check on close. Quite right.
        Hide
        michaelsembwever mck added a comment -

        Jonathan: Is your patch being applied?

        Show
        michaelsembwever mck added a comment - Jonathan: Is your patch being applied?
        Hide
        jbellis Jonathan Ellis added a comment -

        waiting for a +1, wasn't clear if your last comment was intended that way.

        Show
        jbellis Jonathan Ellis added a comment - waiting for a +1, wasn't clear if your last comment was intended that way.
        Hide
        michaelsembwever mck added a comment -

        Yes it was a +1

        Show
        michaelsembwever mck added a comment - Yes it was a +1
        Hide
        jbellis Jonathan Ellis added a comment -

        committed, thanks!

        Show
        jbellis Jonathan Ellis added a comment - committed, thanks!
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra-0.7 #515 (See https://builds.apache.org/job/Cassandra-0.7/515/)
        fix race that could result in Hadoopwriter failing to throw exception for encountered error
        patch by Mck SembWever and jbellis for CASSANDRA-2755

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1140565
        Files :

        • /cassandra/branches/cassandra-0.7/CHANGES.txt
        • /cassandra/branches/cassandra-0.7/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordWriter.java
        Show
        hudson Hudson added a comment - Integrated in Cassandra-0.7 #515 (See https://builds.apache.org/job/Cassandra-0.7/515/ ) fix race that could result in Hadoopwriter failing to throw exception for encountered error patch by Mck SembWever and jbellis for CASSANDRA-2755 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1140565 Files : /cassandra/branches/cassandra-0.7/CHANGES.txt /cassandra/branches/cassandra-0.7/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordWriter.java

          People

          • Assignee:
            michaelsembwever mck
            Reporter:
            gregkatz Greg Katz
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development