Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-4402

Atomicity violation bugs because of misusing concurrent collections

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Fix Version/s: 1.2.0
    • Component/s: None
    • Labels:

      Description

      My name is Yu Lin. I'm a Ph.D. student in the CS department at
      UIUC. I'm currently doing research on mining Java concurrent library
      misusages. I found some misusages of ConcurrentHashMap in Cassandra
      1.1.1, which may result in potential atomicity violation bugs or harm
      the performance.

      The code below is a snapshot of the code in file
      src/java/org/apache/cassandra/db/Table.java from line 348 to 369

      L348 if (columnFamilyStores.containsKey(cfId))
      L349

      { L350 // this is the case when you reset local schema L351 // just reload metadata L352 ColumnFamilyStore cfs = columnFamilyStores.get(cfId); L353 assert cfs.getColumnFamilyName().equals(cfName); ... L364 }

      L365 else
      L366

      { L367 columnFamilyStores.put(cfId, ColumnFamilyStore.createColumnFamilyStore(this, cfName)); L368 }

      In the code above, an atomicity violation may occur between line 348
      and 352. Suppose thread T1 executes line 348 and finds that the
      concurrent hashmap "columnFamilyStores" contains the key
      "cfId". Before thread T1 executes line 352, another thread T2 removes
      the "cfId" key from "columnFamilyStores". Now thread T1 resumes
      execution at line 352 and will get a null value for "cfs". Then the
      next line will throw a NullPointerException when invoking the method
      on "cfs".

      Second, the snapshot above has another atomicity violation. Let's look
      at lines 348 and 367. Suppose a thread T1 executes line 348 and finds
      out the concurrent hashmap does not contain the key "cfId". Before it
      gets to execute line 367, another thread T2 puts a pair <cfid, v> in
      the concurrent hashmap "columnFamilyStores". Now thread T1 resumes
      execution and it will overwrite the value written by thread T2. Thus,
      the code no longer preserves the "put-if-absent" semantics.

      I found some similar misusages in other files:

      In src/java/org/apache/cassandra/gms/Gossiper.java, similar atomicity
      violation may occur if thread T2 puts a value to map
      "endpointStateMap" between lines <1094 and 1099>, <1173 and 1178>. Another
      atomicity violation may occur if thread T2 removes the value on key
      "endpoint" between lines <681 and 683>.

      1. cassandra-1.1.1-4402.txt
        4 kB
        Yu Lin
      2. 4402-v2.txt
        6 kB
        Jonathan Ellis

        Activity

        Hide
        jacklondongood Yu Lin added a comment -

        This is the patch that may fix the atomicity violation problem.

        Show
        jacklondongood Yu Lin added a comment - This is the patch that may fix the atomicity violation problem.
        Hide
        jbellis Jonathan Ellis added a comment -

        Thanks for the patch, Yu. The getExpireTimeForEndpoint is the most serious since we can fairly easily have concurrent contains with remove calls. maybeInitializeLocalState shouldn't be called concurrently but it doesn't hurt to clean that up too.

        I'm not sure what to do about Table.initCf, though. It seems like CASSANDRA-2963 has made that inherently unsafe. A "mechanical" fix like the one here won't help since if two CFS objects are created for the same CF, they will conflict on the mbean definition as well. I'll follow up on that over on the 2963 ticket.

        In the meantime, v2 is attached with some cleanup.

        Show
        jbellis Jonathan Ellis added a comment - Thanks for the patch, Yu. The getExpireTimeForEndpoint is the most serious since we can fairly easily have concurrent contains with remove calls. maybeInitializeLocalState shouldn't be called concurrently but it doesn't hurt to clean that up too. I'm not sure what to do about Table.initCf, though. It seems like CASSANDRA-2963 has made that inherently unsafe. A "mechanical" fix like the one here won't help since if two CFS objects are created for the same CF, they will conflict on the mbean definition as well. I'll follow up on that over on the 2963 ticket. In the meantime, v2 is attached with some cleanup.
        Hide
        slebresne Sylvain Lebresne added a comment -

        That v2 patch lgtm on principle from reading the patch, but it'll need to be rebased.

        Show
        slebresne Sylvain Lebresne added a comment - That v2 patch lgtm on principle from reading the patch, but it'll need to be rebased.
        Hide
        jbellis Jonathan Ellis added a comment -

        committed

        Show
        jbellis Jonathan Ellis added a comment - committed
        Hide
        jbellis Jonathan Ellis added a comment -

        (with comments explaining the CASSANDRA-2963 situation)

        Show
        jbellis Jonathan Ellis added a comment - (with comments explaining the CASSANDRA-2963 situation)

          People

          • Assignee:
            jacklondongood Yu Lin
            Reporter:
            jacklondongood Yu Lin
            Reviewer:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 504h
              504h
              Remaining:
              Remaining Estimate - 504h
              504h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development