Hive
  1. Hive
  2. HIVE-951

Selectively include EXTERNAL TABLE source files via REGEX

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None
    • Tags:
      external table regex

      Description

      CREATE EXTERNAL TABLE should allow users to cherry-pick files via regular expression.
      CREATE EXTERNAL TABLE was designed to allow users to access data that exists outside of Hive, and
      currently makes the assumption that all of the files located under the supplied path should be included
      in the new table. Users frequently encounter directories containing multiple
      datasets, or directories that contain data in heterogeneous schemas, and it's often
      impractical or impossible to adjust the layout of the directory to meet the requirements of
      CREATE EXTERNAL TABLE. A good example of this problem is creating an external table based
      on the contents of an S3 bucket.

      One way to solve this problem is to extend the syntax of CREATE EXTERNAL TABLE
      as follows:

      CREATE EXTERNAL TABLE
      ...
      LOCATION path [file_regex]
      ...

      For example:

      CREATE EXTERNAL TABLE mytable1 ( a string, b string, c string )
      STORED AS TEXTFILE
      LOCATION 's3://my.bucket/' 'folder/2009.*\.bz2$';
      

      Creates mytable1 which includes all files in s3:/my.bucket with a filename matching 'folder/2009*.bz2'

      CREATE EXTERNAL TABLE mytable2 ( d string, e int, f int, g int )
      STORED AS TEXTFILE 
      LOCATION 'hdfs://data/' 'xyz.*2009????.bz2$';
      

      Creates mytable2 including all files matching 'xyz*2009????.bz2' located under hdfs://data/

      1. HIVE-951.patch
        94 kB
        Carl Steinbach

        Issue Links

          Activity

          Hide
          Namit Jain added a comment -

          I am not sure this is a good idea - this will make it very complicated - what if new files get added later.
          Isnt a possible workaround to move/copy the relevant files in a new directory and create a external table on top
          of that ?

          Show
          Namit Jain added a comment - I am not sure this is a good idea - this will make it very complicated - what if new files get added later. Isnt a possible workaround to move/copy the relevant files in a new directory and create a external table on top of that ?
          Hide
          Carl Steinbach added a comment -

          If the files you are trying to access are in an S3 bucket, copying them to a new location
          can be extremely inconvenient. I think most people in this position would gladly accept
          a little extra complexity if it allowed them to access their data without first spending an hour
          or two staging it. I'm also not sure why you think this is complex. Are you concerned about
          details of the implementation or additional demands that this places on the user?

          Show
          Carl Steinbach added a comment - If the files you are trying to access are in an S3 bucket, copying them to a new location can be extremely inconvenient. I think most people in this position would gladly accept a little extra complexity if it allowed them to access their data without first spending an hour or two staging it. I'm also not sure why you think this is complex. Are you concerned about details of the implementation or additional demands that this places on the user?
          Hide
          Namit Jain added a comment -

          I don't think implementation will present a major problem, I am just concerned about the
          user demand on this. Many times, the users use hive and hadoop interchangeably to access
          the data, and this might be difficult for them.

          Show
          Namit Jain added a comment - I don't think implementation will present a major problem, I am just concerned about the user demand on this. Many times, the users use hive and hadoop interchangeably to access the data, and this might be difficult for them.
          Hide
          Avram Aelony added a comment -

          I want to echo Carl's point about copying to a new location being *extremely* inconvenient.

          Sometimes this inconvenience is impossible and a clear show stopper.

          In my mind, Namit's point about interchangeability can be resolved by clear documentation, whereas it is not always possible to copy data around from bucket to bucket especially if the data is quite large. In my case, being forced to stage large volumes of existing S3 data to other S3 data buckets (copying the same data) for the purpose of Hive analysis has really slowed Hive adoption.

          Engineers output data to S3 with their own map/reduce efficiencies in mind without regard for Hive's preferred data organization so analysts are looking forward to having this feature so we can use Hive again.

          Can't wait for this feature!

          Show
          Avram Aelony added a comment - I want to echo Carl's point about copying to a new location being * extremely * inconvenient. Sometimes this inconvenience is impossible and a clear show stopper. In my mind, Namit's point about interchangeability can be resolved by clear documentation, whereas it is not always possible to copy data around from bucket to bucket especially if the data is quite large. In my case, being forced to stage large volumes of existing S3 data to other S3 data buckets (copying the same data) for the purpose of Hive analysis has really slowed Hive adoption. Engineers output data to S3 with their own map/reduce efficiencies in mind without regard for Hive's preferred data organization so analysts are looking forward to having this feature so we can use Hive again. Can't wait for this feature!
          Hide
          Namit Jain added a comment -

          There was a jira on virtual columns - I will pull out the jira number soon.
          One of the virtual columns proposed was filename.

          Do you think the same use case can be solved if filename was present -
          create a external table on the directory, and filter using the file name.
          The performance should be exactly same - pruning should take care of that
          (after some changes).

          Show
          Namit Jain added a comment - There was a jira on virtual columns - I will pull out the jira number soon. One of the virtual columns proposed was filename. Do you think the same use case can be solved if filename was present - create a external table on the directory, and filter using the file name. The performance should be exactly same - pruning should take care of that (after some changes).
          Hide
          Namit Jain added a comment -

          https://issues.apache.org/jira/browse/HIVE-837

          is the jira number for virtual columns I was referring to above

          Show
          Namit Jain added a comment - https://issues.apache.org/jira/browse/HIVE-837 is the jira number for virtual columns I was referring to above
          Hide
          Avram Aelony added a comment -

          I think the filename can contain important information (e.g. datestamp, name of the type of data it represents, etc...) that it is desirable to be able to parse out and then group by.

          Imagine a few year's worth of data where there are 4 or more filetypes (each filetype having a different set of columns) output to a bucket every day (e.g. 20091125_type_A.gz, 20091125_type_B.gz, 20091125_type_C.gz, 20091125_type_D.gz). In fact, each day can contain 20 or more large files per filetype (e.g. 20091125_type_A_01.gz, 20091125_type_A_02.gz, 20091125_type_A_03.gz, ..., 20091125_type_A_20.gz, repeat for B,C,D, etc... ).

          It would be nice to be able to parse out new variables for date, type, and type_number (e.g. 01, 02, ..., 20 ) and be able to compute various aggregated metrics via a group by of these variables parsed from the filenames. Hopefully this parsing out would not be too much of a performance bottleneck..

          So, I think there is a need both for a way to select certain files that match a regex from an S3 bucket, and also a need for capturing filename information such that it can subsequently be available for parsing and grouping. It may be possible to achieve both needs in one use case, but I don't know enough about Hive/Hadoop internals to judge myself.

          Show
          Avram Aelony added a comment - I think the filename can contain important information (e.g. datestamp, name of the type of data it represents, etc...) that it is desirable to be able to parse out and then group by. Imagine a few year's worth of data where there are 4 or more filetypes (each filetype having a different set of columns) output to a bucket every day (e.g. 20091125_type_A.gz, 20091125_type_B.gz, 20091125_type_C.gz, 20091125_type_D.gz). In fact, each day can contain 20 or more large files per filetype (e.g. 20091125_type_A_01.gz, 20091125_type_A_02.gz, 20091125_type_A_03.gz, ..., 20091125_type_A_20.gz, repeat for B,C,D, etc... ). It would be nice to be able to parse out new variables for date, type, and type_number (e.g. 01, 02, ..., 20 ) and be able to compute various aggregated metrics via a group by of these variables parsed from the filenames. Hopefully this parsing out would not be too much of a performance bottleneck.. So, I think there is a need both for a way to select certain files that match a regex from an S3 bucket, and also a need for capturing filename information such that it can subsequently be available for parsing and grouping. It may be possible to achieve both needs in one use case, but I don't know enough about Hive/Hadoop internals to judge myself.
          Hide
          Prasad Chakka added a comment -

          @namit, I agree with others that this would be very convenient and useful functionality for external tables. external table data and location consistency is left to be managed by users. so changing the fileset in that location is not a Hive problem. Hive will query the location and get the list of paths at the time the query is being executed. if a file gets removed or added and query fails because of a deleted file, it is user's responsibility.

          Show
          Prasad Chakka added a comment - @namit, I agree with others that this would be very convenient and useful functionality for external tables. external table data and location consistency is left to be managed by users. so changing the fileset in that location is not a Hive problem. Hive will query the location and get the list of paths at the time the query is being executed. if a file gets removed or added and query fails because of a deleted file, it is user's responsibility.
          Hide
          Avram Aelony added a comment -

          Another consideration I'd like to mention that motivates the selection of files by regex, is the unfortunate occurrence that different filetypes (within a bucket) may or may not share the same record delimiter within a row line of a file.

          I have seen cases where some filetypes are tab delimited and other filetypes are comma delimited, and even cases where files of a type are not even columns but key-value pairings, requiring a map structure in Hive create-table time.

          This feature allowing selection of files by regex will be quite powerful in that it will be flexible enough to be able to read in each table once at a time per filetype.

          Show
          Avram Aelony added a comment - Another consideration I'd like to mention that motivates the selection of files by regex, is the unfortunate occurrence that different filetypes (within a bucket) may or may not share the same record delimiter within a row line of a file. I have seen cases where some filetypes are tab delimited and other filetypes are comma delimited, and even cases where files of a type are not even columns but key-value pairings, requiring a map structure in Hive create-table time. This feature allowing selection of files by regex will be quite powerful in that it will be flexible enough to be able to read in each table once at a time per filetype.
          Hide
          Namit Jain added a comment -

          I am not questioning the utility of this - I dont like the fact that some of the files in a directory belong on one table.
          What if hive supported views, and you can create a view on the required files ?
          Just a thought..

          Having said that, I know it would take more time to implement views than the proposed feature, but just wanted
          to see everyone's opinions.

          Show
          Namit Jain added a comment - I am not questioning the utility of this - I dont like the fact that some of the files in a directory belong on one table. What if hive supported views, and you can create a view on the required files ? Just a thought.. Having said that, I know it would take more time to implement views than the proposed feature, but just wanted to see everyone's opinions.
          Hide
          Avram Aelony added a comment -

          I am not sure I understand the benefit of implementing a view here. Would it impact performance?

          Show
          Avram Aelony added a comment - I am not sure I understand the benefit of implementing a view here. Would it impact performance?
          Hide
          Namit Jain added a comment -

          No, it is only at compile time - we can find the files either when the view is created or when the query referencing the view is
          executed. The same decision needs to be made for the current proposal also.

          The only benefit of a view is it seems cleaner.

          Show
          Namit Jain added a comment - No, it is only at compile time - we can find the files either when the view is created or when the query referencing the view is executed. The same decision needs to be made for the current proposal also. The only benefit of a view is it seems cleaner.
          Hide
          Avram Aelony added a comment -

          Interesting. As a user, I suppose I have no preference other than what can be available sooner to fill the current need.

          Show
          Avram Aelony added a comment - Interesting. As a user, I suppose I have no preference other than what can be available sooner to fill the current need.
          Hide
          Carl Steinbach added a comment -

          HIVE-951.patch implements filename input filtering for external tables using the following syntax:

          CREATE EXTERNAL TABLE ext_tab (
          ...
          )
          STORED AS TEXTFILE
          LOCATION "/user/bob/ext_table_dir"
          FILENAME_FILTER '^b.*';
          

          In this example ext_tab will refer to all files in the directory /user/bob/ext_table_dir that begin with the letter 'b'.

          Show
          Carl Steinbach added a comment - HIVE-951 .patch implements filename input filtering for external tables using the following syntax: CREATE EXTERNAL TABLE ext_tab ( ... ) STORED AS TEXTFILE LOCATION "/user/bob/ext_table_dir" FILENAME_FILTER '^b.*'; In this example ext_tab will refer to all files in the directory /user/bob/ext_table_dir that begin with the letter 'b'.
          Hide
          Namit Jain added a comment -

          Can you regenerate the patch - it is not applying cleanly

          Show
          Namit Jain added a comment - Can you regenerate the patch - it is not applying cleanly
          Hide
          Zheng Shao added a comment -

          Can we extend this patch to solve the issue for HIVE-1083 as well?

          Show
          Zheng Shao added a comment - Can we extend this patch to solve the issue for HIVE-1083 as well?
          Hide
          Namit Jain added a comment -

          It would be a good idea to let the user specify a partition on a bunch of files as well.

          Show
          Namit Jain added a comment - It would be a good idea to let the user specify a partition on a bunch of files as well.
          Hide
          Zheng Shao added a comment -

          With regex, the user can just give us a list like this: "(file1|file2|file3)"

          Show
          Zheng Shao added a comment - With regex, the user can just give us a list like this: "(file1|file2|file3)"
          Hide
          Namit Jain added a comment -

          I meant:

          alter table T add partition (..) files (regex)

          Show
          Namit Jain added a comment - I meant: alter table T add partition (..) files (regex)
          Hide
          Carl Steinbach added a comment -

          Namit's suggestion makes sense to me. I will add the ability to specify regex
          filename filters per partition. I will assume that a partition inherits the parent
          table's filter if it exists and none is specified for the partition. How does that
          sound?

          Show
          Carl Steinbach added a comment - Namit's suggestion makes sense to me. I will add the ability to specify regex filename filters per partition. I will assume that a partition inherits the parent table's filter if it exists and none is specified for the partition. How does that sound?
          Hide
          Zheng Shao added a comment -

          Sounds good to me. Can we support multiple-levels as well? ("[^/]/[^/]/.*)

          Show
          Zheng Shao added a comment - Sounds good to me. Can we support multiple-levels as well? (" [^/] / [^/] /.*)
          Hide
          Zheng Shao added a comment -

          If it's not easy to do, we can still work on HIVE-1083 separately by adding a "recursive" property for a table.

          Show
          Zheng Shao added a comment - If it's not easy to do, we can still work on HIVE-1083 separately by adding a "recursive" property for a table.
          Hide
          Zheng Shao added a comment -

          Hi Carl, please ignore the comment about HIVE-1083 - we can do that by adding a "recursive" boolean attribute to the storage descriptor.

          I assume now we can have 2 partitions of the same external table (or 2 external tables) pointing to the same directory having different filename filters (may overlap in files). Also I assume in the future we want to support tables that are not based on files, e.g., HBase tables.

          Do you agree with these 2 assumptions?

          Show
          Zheng Shao added a comment - Hi Carl, please ignore the comment about HIVE-1083 - we can do that by adding a "recursive" boolean attribute to the storage descriptor. I assume now we can have 2 partitions of the same external table (or 2 external tables) pointing to the same directory having different filename filters (may overlap in files). Also I assume in the future we want to support tables that are not based on files, e.g., HBase tables. Do you agree with these 2 assumptions?
          Hide
          Namit Jain added a comment -

          Different tables which are linked to each other implicitly here and not known to the metastore.

          A house-cleaning operation like : archive partition or something else, can invalidate this new table.
          Unless we have this, this seems risky

          Show
          Namit Jain added a comment - Different tables which are linked to each other implicitly here and not known to the metastore. A house-cleaning operation like : archive partition or something else, can invalidate this new table. Unless we have this, this seems risky
          Hide
          Edward Capriolo added a comment -

          I think this a pretty cool feature but this is very close to symlink input format.

          Show
          Edward Capriolo added a comment - I think this a pretty cool feature but this is very close to symlink input format.
          Hide
          Avram Aelony added a comment -

          I am searching for sufficient documentation on symlink input format to construct an example where this functionality would work absent HIVE-951, but I am not finding docs nor examples.

          Is there documentation on symlink input format for the case where both the symlink file and data bucket is on s3 ?

          Show
          Avram Aelony added a comment - I am searching for sufficient documentation on symlink input format to construct an example where this functionality would work absent HIVE-951 , but I am not finding docs nor examples. Is there documentation on symlink input format for the case where both the symlink file and data bucket is on s3 ?
          Hide
          Chunzhi Meng added a comment -

          I also think this function is very necessary。

          Show
          Chunzhi Meng added a comment - I also think this function is very necessary。
          Hide
          pandeeswaran added a comment -

          Yes This feature will be really useful for loading data from S3.

          Show
          pandeeswaran added a comment - Yes This feature will be really useful for loading data from S3.
          Hide
          Edward Capriolo added a comment -

          This is being handle via another jira I think. This will be done via virtual column support. The file name will be a string that can be used as part of a where clause.

          Show
          Edward Capriolo added a comment - This is being handle via another jira I think. This will be done via virtual column support. The file name will be a string that can be used as part of a where clause.
          Hide
          indrajit added a comment -

          CREATE EXTERNAL TABLE allow users to us the table on the top of HDFS ,
          Its good feature and it does not look for the path whether it is created or not ,
          After creation of table you can lazily create the path

          Show
          indrajit added a comment - CREATE EXTERNAL TABLE allow users to us the table on the top of HDFS , Its good feature and it does not look for the path whether it is created or not , After creation of table you can lazily create the path
          Hide
          indrajit added a comment -

          External table really gives power to use the different tools on top of table . So you can get chance to do data mining. Its really very fast and easy to create

          Show
          indrajit added a comment - External table really gives power to use the different tools on top of table . So you can get chance to do data mining. Its really very fast and easy to create
          Hide
          Kingshuk Chatterjee added a comment -

          Agreed. I came across this ticket while trying to solve an exactly the same problem. I have got GBs worth of files in a folder in S3, and moving them to individual sub folders is a housekeeping I will gladly trade in favor of an added complexity of CREATE TABLE command.

          Show
          Kingshuk Chatterjee added a comment - Agreed. I came across this ticket while trying to solve an exactly the same problem. I have got GBs worth of files in a folder in S3, and moving them to individual sub folders is a housekeeping I will gladly trade in favor of an added complexity of CREATE TABLE command.

            People

            • Assignee:
              Unassigned
              Reporter:
              Carl Steinbach
            • Votes:
              21 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:

                Development