Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2376

SIGSEGV while adding and dropping the same range partition and concurrently writing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.7.0
    • None
    • None
    • None

    Description

      While adding a test to https://gerrit.cloudera.org/#/c/9393/, I ran into the problem that writing while doing a replace tablet operation caused the client to segfault. After inspecting the client code, it looked like the same problem could occur if the same range partition was added and dropped with concurrent writes.

      Attached is a patch that adds a test to alter_table-test that reliably reproduces the segmentation fault.

      I don't totally understand what's happening, but here's what I think I have figured out:

      Suppose the range partition P=[0, 100) is dropped and re-added in a single alter. This causes the tablet X for hash bucket 0 and range partition P to be dropped, and a new one Y created for the same partition. There is a batch pending to X which the client attempts to send to each of the replicas of X in turn. Once the replicas are exhausted, the client attempts to find a new leader with MetaCacheServerPicker::PickLeader, which triggers a master lookup to get the latest consensus info for X (#5 in the big comment in PickLeader). This calls LookupTabletByKey, which attempts a fast path lookup. Assuming other metadata operations have already cached a tablet for Y, the tablet for X will have been removed from the by-table-and-by-key map, and the fast lookup with return an entry for Y. The client code doesn't know the difference because the code paths just look at partition boundaries, which match for X and Y. The lookup doesn't happen, and the client ends up in a pretty tight loop repeating the above process, until the segfault.

      I'm not sure exactly what the segmentation fault is. I looked at it a bit in gdb and the segfault was a few calls deep into STL maps in release mode and inside a refcount increment in debug mode. I'll try to attach some gdb output showing that later.

      The problem is also hinted at in a TODO in PickLeader:

      // TODO: When we support tablet splits, we should let the lookup shift
      // the write to another tablet (i.e. if it's since been split).
      

      Attachments

        1. alter_table-test.patch
          3 kB
          William Berkeley

        Activity

          People

            Unassigned Unassigned
            wdberkeley William Berkeley
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: