Giraph
  1. Giraph
  2. GIRAPH-462

Multithreading breaks out-of-core graph

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Claudio Martella pointed out this issue: when using multithreaded computation in conjunction with out-of-core graph, we incur in a race condition. The compute threads share the same DiskBackedPartitionStore, whose getPartition() method is not meant to be thread-safe. When two threads request two out-of-core partitions concurrently, they both try to load it to the same slot.
      The result is that we can lose the reference to one of the two partitions (which will not be written back to disk) and we can incur in a NullPointerException when both threads are trying to offload the currently loaded partition to disk.

      I ran this test to confirm the issue:
      https://gist.github.com/4429628
      All tests pass except the one that uses both out-of-core graph and multiple compute threads.
      The error is the following:
      https://gist.github.com/4429650

      1. GIRAPH-461.patch
        45 kB
        Claudio Martella

        Issue Links

          Activity

          Hide
          Gustavo Salazar Torres added a comment -

          What if instead of this pull model a publish/subscribe would be used? That way workers, instead of calling directly the getPartition() method, another object, let's call it PartitionCoordinator, would receive subscribe events from workers expecting to receive a publish event from PartitionCoordinator when a partition is available.
          Workers would have to block themselves until they receive the publish event.

          Show
          Gustavo Salazar Torres added a comment - What if instead of this pull model a publish/subscribe would be used? That way workers, instead of calling directly the getPartition() method, another object, let's call it PartitionCoordinator, would receive subscribe events from workers expecting to receive a publish event from PartitionCoordinator when a partition is available. Workers would have to block themselves until they receive the publish event.
          Hide
          Claudio Martella added a comment -

          This is exactly how I have implemented the LRU PartitionStore. Unfortunately, I have not been able to test it in pseudo-distributed mode as I cannot make to run trunk on hadoop 1.0.4. My PartitionStore is currently passing local tests and my ad-hoc tests for the PartitionStore. I attach here the current patch.

          Show
          Claudio Martella added a comment - This is exactly how I have implemented the LRU PartitionStore. Unfortunately, I have not been able to test it in pseudo-distributed mode as I cannot make to run trunk on hadoop 1.0.4. My PartitionStore is currently passing local tests and my ad-hoc tests for the PartitionStore. I attach here the current patch.
          Hide
          Claudio Martella added a comment -

          The LRU PartitionStore is concurrent

          Show
          Claudio Martella added a comment - The LRU PartitionStore is concurrent

            People

            • Assignee:
              Unassigned
              Reporter:
              Alessandro Presta
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development