Uploaded image for project: 'Apache Celeborn'
  1. Apache Celeborn
  2. CELEBORN-1669

Fix NullPointerException for PartitionFilesSorter#updateSortedShuffleFiles after cleaning up expired shuffle key

    XMLWordPrintableJSON

Details

    Description

      `PartitionFilesSorter` sorts shuffle files in `worker-file-sorter-executor` thread and cleans up expired key in `worker-expired-shuffle-cleaner` thread. There is a case that after `worker-expired-shuffle-cleaner` cleaning up expired shuffle key, `worker-file-sorter-executor` updates sorted shuffle files, which causes NullPointerException at present.

      2024-10-23 17:26:17,162 [INFO] [worker-expired-shuffle-cleaner] - org.apache.celeborn.service.deploy.worker.Worker -Logging.scala(51) -Cleaned up expired shuffle application_1724141892576_3843182_1-0
      2024-10-23 17:26:17,392 [ERROR] [worker-file-sorter-executor-237572] - org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter -PartitionFilesSorter.java(752) -Sorting shuffle file for application_1724141892576_3843182_1-0-1875-0-0 /mnt/storage02/celeborn-worker/shuffle_data/application_1724141892576_3843182_1/0/1875-0-0 failed, detail: 
      java.lang.NullPointerException: null
          at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.updateSortedShuffleFiles(PartitionFilesSorter.java:455) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
          at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.sort(PartitionFilesSorter.java:747) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
          at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.lambda$new$1(PartitionFilesSorter.java:164) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
          at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
          at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162] 

      Attachments

        Issue Links

          Activity

            People

              nicholasjiang Nicholas Jiang
              nicholasjiang Nicholas Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m