Hive
  1. Hive
  2. HIVE-1083

allow sub-directories for an external table/partition

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.6.0
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:

      Description

      Sometimes users want to define an external table/partition based on all files (recursively) inside a directory.

      Currently most of the Hadoop InputFormat classes do not support that. We should extract all files recursively in the directory, and add them to the input path of the job.

        Issue Links

          Activity

          Hide
          Harsh J added a comment -

          Hi,

          Can you confirm you used MR2, and that the config toggle specified on MAPREDUCE-1501 was turned on?

          Show
          Harsh J added a comment - Hi, Can you confirm you used MR2, and that the config toggle specified on MAPREDUCE-1501 was turned on?
          Hide
          Jean-Marc Spaggiari added a comment -

          Hi don't think MAPREDUCE-1501 is fixing this issue. Still getting errors like:

          java.io.IOException: Not a file: hdfs://namenode/user/abcd/data/efgh/logs/2013/07/02
          at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:212)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:292)
          at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:329)
          at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1090)
          at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1082)
          
          Show
          Jean-Marc Spaggiari added a comment - Hi don't think MAPREDUCE-1501 is fixing this issue. Still getting errors like: java.io.IOException: Not a file: hdfs: //namenode/user/abcd/data/efgh/logs/2013/07/02 at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:212) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:292) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:329) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1090) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1082)
          Hide
          Harsh J added a comment -

          Given that MAPREDUCE-1501 is in MR2 today, and Hive can make use of it, should we close this out now?

          Show
          Harsh J added a comment - Given that MAPREDUCE-1501 is in MR2 today, and Hive can make use of it, should we close this out now?
          Hide
          John Sichi added a comment -

          Correction: local file system is probably OK; I just realized that when I tested, I was using the stock hadoop 0.20 version which does not have MAPREDUCE-1501 in it.

          Show
          John Sichi added a comment - Correction: local file system is probably OK; I just realized that when I tested, I was using the stock hadoop 0.20 version which does not have MAPREDUCE-1501 in it.
          Hide
          John Sichi added a comment -

          Clarification: you can already get the desired behavior using HDFS, MAPREDUCE-1501, and mapred.input.dir.recursive=true, as long as your query doesn't hit one of the corner cases enuemrated by Zheng.

          What remains for this task are the following:

          (1) support local file system as well (this failed when I tested it, but I didn't look into why)

          (2) deal with HIVE-1133 to fix the corner cases

          Show
          John Sichi added a comment - Clarification: you can already get the desired behavior using HDFS, MAPREDUCE-1501 , and mapred.input.dir.recursive=true, as long as your query doesn't hit one of the corner cases enuemrated by Zheng. What remains for this task are the following: (1) support local file system as well (this failed when I tested it, but I didn't look into why) (2) deal with HIVE-1133 to fix the corner cases
          Hide
          Zheng Shao added a comment -

          Corner cases:
          C1. We have 4 external tables: abc_recursive, abc, abc_def_recursive, abc_def
          abc_recursive and abc both points to /abc
          abc_def and abc_def_recursive both points to /abc/def
          abc_recursive and abc_def_recursive have set the bit "recursive".

          In ExecDriver, given all tables, we need to find all paths that needs to be added to the input path.
          In MapOperator, given the current input path, we need to find all the aliases that the current input path corresponds to.

          Show
          Zheng Shao added a comment - Corner cases: C1. We have 4 external tables: abc_recursive, abc, abc_def_recursive, abc_def abc_recursive and abc both points to /abc abc_def and abc_def_recursive both points to /abc/def abc_recursive and abc_def_recursive have set the bit "recursive". In ExecDriver, given all tables, we need to find all paths that needs to be added to the input path. In MapOperator, given the current input path, we need to find all the aliases that the current input path corresponds to.
          Hide
          Zheng Shao added a comment -

          This can be done as part of HIVE-951.
          Basically, by default, a table's file name pattern will be "*".
          If users want to include all files, then he just needs to say something like "*/".

          We can either use java regex or globbing (,*,?). There are both pros and cons.
          I don't see a reason why one is significantly better than the other.

          Show
          Zheng Shao added a comment - This can be done as part of HIVE-951 . Basically, by default, a table's file name pattern will be "*". If users want to include all files, then he just needs to say something like "* / ". We can either use java regex or globbing ( , *,?). There are both pros and cons. I don't see a reason why one is significantly better than the other.
          Hide
          Raghotham Murthy added a comment -

          The one use case where this will be helpful is when creating external tables on existing directory trees. We currently need to create one partition per lowest level directory. Instead, it would be great if hive allowed creation of a partition on a top-level directory and hive picks up all files within that directory tree.

          Show
          Raghotham Murthy added a comment - The one use case where this will be helpful is when creating external tables on existing directory trees. We currently need to create one partition per lowest level directory. Instead, it would be great if hive allowed creation of a partition on a top-level directory and hive picks up all files within that directory tree.

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Namit Jain
            • Votes:
              15 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:

                Development