Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1808

Job can fail since name of intermediate files can be too long in specific situation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5.3
    • None
    • None

    Description

      I ran Hive 0.14 on Tez 0.5.2 and master with MemToMemMerger disabled - this configuration change is the diff between TEZ-1807 and this JIRA. Data size is 100GB texts generated by RandomTextWriter.

      create external table randomText100GB(
        text string
      ) location 'hdfs:///user/ozawa/randomText100GB';
      
      CREATE TABLE wordcount AS
      SELECT word, count(1) AS count
      FROM (SELECT EXPLODE(SPLIT(LCASE(REGEXP_REPLACE(text,'[\\p{Punct},\\p{Cntrl}]','')),' '))
      AS word FROM randomText100GB) words
      GROUP BY word;
      

      As a result, an exception is thrown:

      --------------------------------------------------------------------------------
      VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
      --------------------------------------------------------------------------------
      Map 1 ......... KILLED 115 110 0 5 0 5
      Reducer 2 FAILED 3 0 0 3 1 2
      --------------------------------------------------------------------------------
      VERTICES: 00/02 [========================>>--] 93% ELAPSED TIME: 110.95 s
      --------------------------------------------------------------------------------
      Status: Failed
      Vertex failed, vertexName=Reducer 2, vertexId=vertex_1417036912823_0073_1_01, diagnostics=[Task failed, taskId=task_1417036912823_0073_1_01_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: exceptionThrown=org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in DiskToDiskMerger [Map_1]
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:338)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:319)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.FileNotFoundException: /hadoop1/tmp/nm-local-dir/usercache/ozawa/appcache/application_1417036912823_0073/attempt_1417036912823_0073_1_01_000000_0_10026_spill_215.out.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged (File name too long)
      at java.io.FileOutputStream.open(Native Method)
      at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
      at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:211)
      at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:207)
      at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:270)
      at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:257)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:773)
      at org.apache.tez.runtime.library.common.sort.impl.IFile$Writer.<init>(IFile.java:129)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager$OnDiskMerger.merge(MergeManager.java:702)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeThread.run(MergeThread.java:89)
      , errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in DiskToDiskMerger [Map_1]
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:338)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.call(Shuffle.java:319)
      at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      at java.lang.Thread.run(Thread.java:745)
      Caused by: java.io.FileNotFoundException: /hadoop1/tmp/nm-local-dir/usercache/ozawa/appcache/application_1417036912823_0073/attempt_1417036912823_0073_1_01_000000_0_10026_spill_215.out.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged (File name too long)
      at java.io.FileOutputStream.open(Native Method)
      at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
      at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:211)
      at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:207)
      at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:270)
      at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:257)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
      at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:773)
      at org.apache.tez.runtime.library.common.sort.impl.IFile$Writer.<init>(IFile.java:129)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager$OnDiskMerger.merge(MergeManager.java:702)
      at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeThread.run(MergeThread.java:89)
      ]], Vertex failed as one or more tasks failed. failedTasks:1, Vertex vertex_1417036912823_0073_1_01 [Reducer 2] killed/failed due to:null]
      Vertex killed, vertexName=Map 1, vertexId=vertex_1417036912823_0073_1_00, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0, Vertex vertex_1417036912823_0073_1_00 [Map 1] killed/failed due to:null]
      DAG failed due to vertex failure. failedVertices:1 killedVertices:1
      FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask

      The log message of this line looks strange:

      Caused by: java.io.FileNotFoundException: /hadoop1/tmp/nm-local-dir/usercache/ozawa/appcache/application_1417036912823_0073/attempt_1417036912823_0073_1_01_000000_0_10026_spill_215.out.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged.merged (File name too long)

      Attachments

        1. TEZ-1808.1.patch
          3 kB
          Tsuyoshi Ozawa
        2. TEZ-1808.2.patch
          3 kB
          Tsuyoshi Ozawa
        3. TEZ-1808.3.patch
          2 kB
          Tsuyoshi Ozawa
        4. TEZ-1808-wip.1.patch
          1.0 kB
          Tsuyoshi Ozawa

        Activity

          People

            ozawa Tsuyoshi Ozawa
            ozawa Tsuyoshi Ozawa
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: