Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3478

Cleanup fetcher data for failing task attempts (Unordered fetcher)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Env: 3 node AWS cluster with entire dataset in S3. Since data is in S3, it does have not additional storage for HDFS (uses existing space available in VMs). tez version is 0.7.

      With some workloads (e.g q29 in tpcds), unordered fetchers download data in parallel for different vertices and runs out of disk space. However, downloaded
      data related to these failed task attempts are not cleared. So subsequent task attempts also encounter similar situation and fails with "No space" exception. e.g stack trace

      , errorMessage=Fetch failed:org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:261)
              at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
              at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
              at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
              at java.io.DataOutputStream.write(DataOutputStream.java:107)
              at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:426)
              at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206)
              at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:124)
              at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:110)
              at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
              at java.io.DataOutputStream.write(DataOutputStream.java:107)
              at org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToDisk(ShuffleUtils.java:146)
              at org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:771)
              at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:497)
              at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:396)
              at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:195)
              at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:70)
              at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
              at java.util.concurrent.FutureTask.run(FutureTask.java:262)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.IOException: No space left on device
              at java.io.FileOutputStream.writeBytes(Native Method)
              at java.io.FileOutputStream.write(FileOutputStream.java:345)
              at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSy
      

      This would also affect any other job running in the cluster at the same time. It would be helpful to clean up the data downloaded for the failed task attempts.
      Creating this ticket mainly for unordered fetcher case, though it could be similar case for ordered shuffle case as well.

      e.g files

      17M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_0_10023_src_62_spill_-1.out
      18M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_0_10023_src_63_spill_-1.out
      16M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_0_10023_src_64_spill_-1.out
      ..
      ..
      
      18M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_0_spill_-1.out
      17M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_13_spill_-1.out
      16M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_15_spill_-1.out
      16M	/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_000028_2_10003_src_17_spill_-1.ou
      

      Attachments

        1. TEZ-3479.1.patch
          8 kB
          Rajesh Balamohan
        2. TEZ-3479.branch-0.7.001.patch
          8 kB
          Rajesh Balamohan

        Activity

          People

            rajesh.balamohan Rajesh Balamohan
            rajesh.balamohan Rajesh Balamohan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: