Hive
  1. Hive
  2. HIVE-1133

Refactor InputFormat and OutputFormat for Hive

    Details

      Description

      Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.

      The requirements are:
      R1. We want to support HBase: HIVE-806
      R2. We want to selectively include files based on file names: HIVE-951
      R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
      R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
      R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)

      We need to structure these requirements and the code structure in a good way to make it extensible.

        Issue Links

          Activity

          Hide
          Joel Bondurant added a comment -

          A workaround for S3 is to port all Hive code to Pig.

          Show
          Joel Bondurant added a comment - A workaround for S3 is to port all Hive code to Pig.
          Joel Bondurant made changes -
          Affects Version/s 0.6.0 [ 12314524 ]
          Owen O'Malley made changes -
          Link This issue is related to HIVE-3660 [ HIVE-3660 ]
          Carl Steinbach made changes -
          Component/s HBase Handler [ 12313461 ]
          Component/s Serializers/Deserializers [ 12312585 ]
          John Sichi made changes -
          Link This issue blocks HIVE-1226 [ HIVE-1226 ]
          John Sichi made changes -
          Link This issue blocks HIVE-1222 [ HIVE-1222 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HIVE-705 [ HIVE-705 ]
          Hide
          He Yongqiang added a comment -

          Add another possible requirement:
          add support for Zebra's file format.

          Show
          He Yongqiang added a comment - Add another possible requirement: add support for Zebra's file format.
          Hide
          Namit Jain added a comment -

          R4. We can even exploit the sorted characteristics of the data.We know that a table is sorted/bucketed,
          but never make use of it.

          Show
          Namit Jain added a comment - R4. We can even exploit the sorted characteristics of the data.We know that a table is sorted/bucketed, but never make use of it.
          Hide
          Ning Zhang added a comment -

          R4 (pushing down simple predicates) is also useful for RCFile or any FileFormat internal to Hive since we can implement a faster "search-based" HiveRecordReader that takes a set of predicates and only returns satisfying records.

          Show
          Ning Zhang added a comment - R4 (pushing down simple predicates) is also useful for RCFile or any FileFormat internal to Hive since we can implement a faster "search-based" HiveRecordReader that takes a set of predicates and only returns satisfying records.
          Hide
          Zheng Shao added a comment -

          Thanks for the note, Bennie. In the future, please assign it to yourself click "submit patch" so that we know it's ready for review (we will "cancel patch" if we have comments).

          Show
          Zheng Shao added a comment - Thanks for the note, Bennie. In the future, please assign it to yourself click "submit patch" so that we know it's ready for review (we will "cancel patch" if we have comments).
          Hide
          Bennie Schut added a comment -

          This could conflict with some changes I made for HIVE-1019. A patch is available for that one.

          Show
          Bennie Schut added a comment - This could conflict with some changes I made for HIVE-1019 . A patch is available for that one.
          Hide
          Zheng Shao added a comment -

          Functions related to this refactoring:

          ExecDriver.addInputPaths
          HiveInputFormat.getSplits
          CombineHiveInputFormat.getSplits
          ExecMap.configure
          
          Show
          Zheng Shao added a comment - Functions related to this refactoring: ExecDriver.addInputPaths HiveInputFormat.getSplits CombineHiveInputFormat.getSplits ExecMap.configure
          Zheng Shao made changes -
          Link This issue blocks HIVE-806 [ HIVE-806 ]
          Zheng Shao made changes -
          Link This issue blocks HIVE-951 [ HIVE-951 ]
          Zheng Shao made changes -
          Link This issue blocks HIVE-1083 [ HIVE-1083 ]
          Zheng Shao made changes -
          Field Original Value New Value
          Description Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.

          The requirements are:
          R1. We want to support HBase: HIVE-806
          R2. We want to selectively include files based on file names: HIVE-951
          R3. We want to optionally choose to recurse on the directory structure: HIVE-108
          R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
          R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)

          We need to structure these requirements and the code structure in a good way to make it extensible.
          Currently we ran into several problems of the FileInputFormat/OutputFormat in Hive.

          The requirements are:
          R1. We want to support HBase: HIVE-806
          R2. We want to selectively include files based on file names: HIVE-951
          R3. We want to optionally choose to recurse on the directory structure: HIVE-1083
          R4. We want to pass the filter condition into the storage (very useful for HBase, and indexed data format)
          R5. We want to pass the column selection information into the storage (already done as part of the RCFile, but we can do it better)

          We need to structure these requirements and the code structure in a good way to make it extensible.
          Zheng Shao created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Zheng Shao
            • Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development