Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7086

Add config to allow FileInputFormat to ignore directories when recursive=false

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.2.0, 3.1.1
    • None
    • None
    • Reviewed

    Description

      We are trying to create a split in Hive that will only read files in a directory and not subdirectories.
      That fails with the below error.
      Given how this error comes about (two pieces of code interact, one explicitly adding directories to results without failing, and one failing on any directories in results), this seems like a bug.

      Caused by: java.io.IOException: Not a file: file:/,...warehouse/simple_to_mm_text/delta_0000001_0000001_0000
      	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329) ~[hadoop-mapreduce-client-core-3.1.0.jar:?]
      	at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:553) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:754) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
      	at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:203) ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
      

      This code, when recursion is disabled, adds directories to results

       
      if (recursive && stat.isDirectory()) {
                    result.dirsNeedingRecursiveCalls.add(stat);
                  } else {
                    result.locatedFileStatuses.add(stat);
                  }
      

      However the getSplits code after that computes the size like this

      long totalSize = 0;                           // compute total size
          for (FileStatus file: files) {                // check we have valid files
            if (file.isDirectory()) {
              throw new IOException("Not a file: "+ file.getPath());
            }
            totalSize +=
      

      which would always fail combined with the above code.

      Attachments

        1. HADOOP-15403.patch
          2 kB
          Sergey Shelukhin
        2. MAPREDUCE-7086.01.patch
          7 kB
          Sergey Shelukhin
        3. MAPREDUCE-7086.patch
          2 kB
          Sergey Shelukhin

        Issue Links

          Activity

            People

              sershe Sergey Shelukhin
              sershe Sergey Shelukhin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: