Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      This patch adds support for the 'TABLESAMPLE(x PERCENT)' clause.

      Description

      We need a better input sampling to serve at least two purposes:
      1. test their queries against a smaller data set
      2. understand more about how the data look like without scanning the whole table.
      A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

      1. HIVE-2121.8.patch
        754 kB
        Siying Dong
      2. HIVE-2121.7.patch
        266 kB
        Siying Dong
      3. HIVE-2121.6.patch
        266 kB
        Siying Dong
      4. HIVE-2121.5.patch
        218 kB
        Siying Dong
      5. HIVE-2121.4.patch
        209 kB
        Siying Dong
      6. HIVE-2121.3.patch
        210 kB
        Siying Dong
      7. HIVE-2121.2.patch
        208 kB
        Siying Dong
      8. HIVE-2121.1.patch
        37 kB
        Siying Dong

        Activity

        Hide
        Namit Jain added a comment -

        Committed. Thanks Siying

        Show
        Namit Jain added a comment - Committed. Thanks Siying
        Hide
        Namit Jain added a comment -

        running tests

        Show
        Namit Jain added a comment - running tests
        Hide
        Siying Dong added a comment -

        fix tests or test outputs.

        Show
        Siying Dong added a comment - fix tests or test outputs.
        Hide
        Namit Jain added a comment -

        Some tests are failing:

        split_sample.q

        Some tests in TestMinimrCliDriver (log updates)

        almost all test in TestParse (log updates)

        Show
        Namit Jain added a comment - Some tests are failing: split_sample.q Some tests in TestMinimrCliDriver (log updates) almost all test in TestParse (log updates)
        Hide
        Namit Jain added a comment -

        +1

        Show
        Namit Jain added a comment - +1
        Hide
        Siying Dong added a comment -

        move instanceof InputSplitShim to assert.

        Show
        Siying Dong added a comment - move instanceof InputSplitShim to assert.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review605
        -----------------------------------------------------------

        trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java
        <https://reviews.apache.org/r/633/#comment1249>

        talked to siying offline -

        the check:

        if (split instanceof Hadoop20Shims.InputSplitShim)

        is not needed - this can be replaced by an assert.

        Same in Hadoop20SShims.

        Otherwise looks good

        • namit

        On 2011-04-28 08:32:17, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-28 08:32:17)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852

        trunk/conf/hive-default.xml 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852

        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852

        trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852

        trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852

        trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review605 ----------------------------------------------------------- trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java < https://reviews.apache.org/r/633/#comment1249 > talked to siying offline - the check: if (split instanceof Hadoop20Shims.InputSplitShim) is not needed - this can be replaced by an assert. Same in Hadoop20SShims. Otherwise looks good namit On 2011-04-28 08:32:17, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-28 08:32:17) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        Namit Jain added a comment -

        minor comments in review-board.

        Show
        Namit Jain added a comment - minor comments in review-board.
        Hide
        Siying Dong added a comment -

        forgot a file.

        Show
        Siying Dong added a comment - forgot a file.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/
        -----------------------------------------------------------

        (Updated 2011-04-28 08:32:17.534107)

        Review request for hive, Ning Zhang and namit jain.

        Changes
        -------

        Two changes made according to Namit's comments:
        1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework)
        2. the granularity of sampling is down from split-level to HDFS block level.

        Summary
        -------

        We need a better input sampling to serve at least two purposes:
        1. test their queries against a smaller data set
        2. understand more about how the data look like without scanning the whole table.
        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.
        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs (updated)


        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852
        trunk/conf/hive-default.xml 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION
        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852
        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION
        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION
        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION
        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852
        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852
        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852
        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852
        trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852
        trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852
        trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-28 08:32:17.534107) Review request for hive, Ning Zhang and namit jain. Changes ------- Two changes made according to Namit's comments: 1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework) 2. the granularity of sampling is down from split-level to HDFS block level. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs (updated) trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 1096852 trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 1096852 trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        Siying Dong added a comment -

        Two changes made according to Namit's comments:
        1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework)
        2. the granularity of sampling is down from split-level to HDFS block level.

        Show
        Siying Dong added a comment - Two changes made according to Namit's comments: 1. explain will print out some about the sampling. (It might not be the best way to print but it follows the framework) 2. the granularity of sampling is down from split-level to HDFS block level.
        Hide
        Namit Jain added a comment -

        A few other comments:

        The percentage of data read is currently a function
        of the split size - we should mark the last split
        specially, and only read the required data at runtime.
        I mean, if we have only 1 split, and we need to sample
        10% of the data, there should be a way to do so. Currently,
        it seems impossible.

        Show
        Namit Jain added a comment - A few other comments: The percentage of data read is currently a function of the split size - we should mark the last split specially, and only read the required data at runtime. I mean, if we have only 1 split, and we need to sample 10% of the data, there should be a way to do so. Currently, it seems impossible.
        Hide
        Namit Jain added a comment -

        A few other comments:

        1. The behavior should be reflected in explain plan extended.
        Yongqiang recently added a similar plan change in pathtoAlias etc.
        in HIVE-2126. Take a look as an example.
        2. It would increase the utility if I can add a new session variable
        where I specify the split percentage - this will be applicable to
        all tables (this can also be done in a follow-up)

        Show
        Namit Jain added a comment - A few other comments: 1. The behavior should be reflected in explain plan extended. Yongqiang recently added a similar plan change in pathtoAlias etc. in HIVE-2126 . Take a look as an example. 2. It would increase the utility if I can add a new session variable where I specify the split percentage - this will be applicable to all tables (this can also be done in a follow-up)
        Hide
        Namit Jain added a comment -

        Add the new configuration variables in the name of the jira

        Show
        Namit Jain added a comment - Add the new configuration variables in the name of the jira
        Hide
        Siying Dong added a comment -

        remove the switch.
        I ran it against about 800 partitions input and didn't see obvious client time increase.

        Show
        Siying Dong added a comment - remove the switch. I ran it against about 800 partitions input and didn't see obvious client time increase.
        Hide
        Siying Dong added a comment -

        remove the switch

        Show
        Siying Dong added a comment - remove the switch
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review567
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/633/#comment1205>

        This function could be quite expensive inside the loops. You may want to test a case where there are large # of partitions and each partition contains a large # of small files.

        • Ning

        On 2011-04-26 21:19:18, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-26 21:19:18)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852

        trunk/conf/hive-default.xml 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852

        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review567 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/633/#comment1205 > This function could be quite expensive inside the loops. You may want to test a case where there are large # of partitions and each partition contains a large # of small files. Ning On 2011-04-26 21:19:18, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-26 21:19:18) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-04-26 20:50:30, Siying Dong wrote:

        > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498

        > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>

        >

        > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

        Ning Zhang wrote:

        I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.

        Siying Dong wrote:

        OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this.

        Sounds good. Yes, we should clearly document it in the wiki by what it does and does not.

        On 2011-04-26 20:50:30, Siying Dong wrote:

        > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392

        > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>

        >

        > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

        Ning Zhang wrote:

        I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.

        Siying Dong wrote:

        I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected.

        This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well.

        oh I see. Thanks for the clarification!

        • Ning

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review561
        -----------------------------------------------------------

        On 2011-04-26 21:19:18, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-26 21:19:18)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852

        trunk/conf/hive-default.xml 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852

        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-04-26 20:50:30, Siying Dong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498 > < https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498 > > > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think? Ning Zhang wrote: I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. Siying Dong wrote: OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this. Sounds good. Yes, we should clearly document it in the wiki by what it does and does not. On 2011-04-26 20:50:30, Siying Dong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392 > < https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392 > > > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit. Ning Zhang wrote: I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. Siying Dong wrote: I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected. This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well. oh I see. Thanks for the clarification! Ning ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review561 ----------------------------------------------------------- On 2011-04-26 21:19:18, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-26 21:19:18) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-04-26 20:50:30, Siying Dong wrote:

        > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498

        > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>

        >

        > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

        Ning Zhang wrote:

        I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.

        OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this.

        On 2011-04-26 20:50:30, Siying Dong wrote:

        > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392

        > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>

        >

        > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

        Ning Zhang wrote:

        I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.

        I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected.
        This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well.

        • Siying

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review561
        -----------------------------------------------------------

        On 2011-04-26 21:19:18, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-26 21:19:18)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852

        trunk/conf/hive-default.xml 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852

        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-04-26 20:50:30, Siying Dong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498 > < https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498 > > > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think? Ning Zhang wrote: I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. OK. I'll remove this switch. We need to document and communicate very well to users. People will easily misunderstand this. On 2011-04-26 20:50:30, Siying Dong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392 > < https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392 > > > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit. Ning Zhang wrote: I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. I think it is the misunderstanding here. Limit works with split sampling well. No exception will be thrown in any of those two combination and the result will be what we expected. This condition only disabled the optimization that runs against a smaller data set for some limit queries. With split sampling, user already specific what a percentage to sample, there is no need to further run the query against a small subset of the inputs. From test case split_sample.q, you can already see how they work together well. Siying ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review561 ----------------------------------------------------------- On 2011-04-26 21:19:18, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-26 21:19:18) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        jiraposter@reviews.apache.org added a comment -

        On 2011-04-26 20:50:30, Siying Dong wrote:

        > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498

        > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>

        >

        > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

        I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly.

        On 2011-04-26 20:50:30, Siying Dong wrote:

        > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392

        > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>

        >

        > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

        I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it.

        • Ning

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review561
        -----------------------------------------------------------

        On 2011-04-26 21:19:18, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-26 21:19:18)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852

        trunk/conf/hive-default.xml 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852

        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - On 2011-04-26 20:50:30, Siying Dong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 498 > < https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498 > > > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think? I think if we specify clearly the semantics of block-level sample in the wiki/documentation, there shouldn't be much confusion. In fact I think it is much easier to explain than the bucket-based sampling. In addition if the user has confusions about the semantics, throwing an SemanticException won't help them understand. I think the only use case for this parameter is to act as a gatekeeper to this feature if we found a bug in it and want to disable the feature quickly. That should be able to be achieved by switching branches quickly. If we have a gatekeeper parameter for each feature, the conf will grow unnecessarily large quickly. On 2011-04-26 20:50:30, Siying Dong wrote: > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, line 6392 > < https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392 > > > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit. I think combining these two still makes sense: 1) as you mentioned block sampling is not limiting on the # of rows, but limit is. Combining these two allows the users to get approximately N rows quickly. 2) this restriction makes an exception in terms of the query language composition. From the language syntax, it is allowed and makes senses to combine block-sampling and limit, but the user will get a SemanticException if they do. I think SemanticException should be thrown only when there is a legitimate semantic error (e.g., the percentage number is negative). If you feel that it is not a major use case and would rather do it in a follow-up JIRA, we should document it in TODO and file a JIRA for it. Ning ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review561 ----------------------------------------------------------- On 2011-04-26 21:19:18, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-26 21:19:18) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/
        -----------------------------------------------------------

        (Updated 2011-04-26 21:19:18.557345)

        Review request for hive, Ning Zhang and namit jain.

        Changes
        -------

        Addressing most of Ning's comments.

        Summary
        -------

        We need a better input sampling to serve at least two purposes:
        1. test their queries against a smaller data set
        2. understand more about how the data look like without scanning the whole table.
        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.
        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs (updated)


        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852
        trunk/conf/hive-default.xml 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION
        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852
        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION
        trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION
        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION
        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION
        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852
        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852
        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852
        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852
        trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852

        Diff: https://reviews.apache.org/r/633/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-26 21:19:18.557345) Review request for hive, Ning Zhang and namit jain. Changes ------- Addressing most of Ning's comments. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs (updated) trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 trunk/conf/hive-default.xml 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1096852 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        Siying Dong added a comment -

        addressing most of Ning's comments, except the two places I commented on review board.

        Show
        Siying Dong added a comment - addressing most of Ning's comments, except the two places I commented on review board.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review561
        -----------------------------------------------------------

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        <https://reviews.apache.org/r/633/#comment1196>

        I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think?

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        <https://reviews.apache.org/r/633/#comment1197>

        limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit.

        • Siying

        On 2011-04-20 18:28:29, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-20 18:28:29)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244

        trunk/conf/hive-default.xml 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244

        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review561 ----------------------------------------------------------- trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java < https://reviews.apache.org/r/633/#comment1196 > I feel like it is a little hard to explain what this sample guarantees. It basically only guarantees that we fetch at least the sampled percentage of source data. Not exact number, nor guarantee for #rows. I think an option to disable it is a way to avoid confusion in some ways. How do you think? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java < https://reviews.apache.org/r/633/#comment1197 > limit can be combined with block sampling. Just this optimization for limit doesn't make sense when users already sample the input data and we won't get much benefit. Siying On 2011-04-20 18:28:29, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-20 18:28:29) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 trunk/conf/hive-default.xml 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/#review558
        -----------------------------------------------------------

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
        <https://reviews.apache.org/r/633/#comment1178>

        The naming of this parameter is a little bit confusing: the parameter key is called "randomnumber" but the value of it is a fixed number. Do you mean this number is actually the seed to generate samples?

        trunk/conf/hive-default.xml
        <https://reviews.apache.org/r/633/#comment1179>

        same as above.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/633/#comment1195>

        better add a comment for this function explain what the rationale behind sampling splits.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/633/#comment1192>

        we should declare retLists as interface (List) rather than implementation (ArrayList)

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/633/#comment1193>

        same here, should declare it as Map rather than HashMap.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
        <https://reviews.apache.org/r/633/#comment1194>

        can you add comments here?

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        <https://reviews.apache.org/r/633/#comment1182>

        can you add a comment on what this Map is used for and what are the key and value of the Map?

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        <https://reviews.apache.org/r/633/#comment1184>

        I think we don't need to introduce this parameter at all. For one it is a new feature rather than a different code path for an old feature. We don't need the "fallback" protection by a new parameter. Secondly, throwing a SemanticException here can only make the user asking how to solve this problem, which is to set the parameter to true. So it seems that it doesn't make sense to set the parameter to false in any cases. So why not remove the this parameter to make it cleaner.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        <https://reviews.apache.org/r/633/#comment1189>

        you may want to check the percentage number (if it is a valid double and within the range [0,100]) and throw SemanticException if it is invalid before creating a SplitSample object.

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
        <https://reviews.apache.org/r/633/#comment1190>

        why limit cannot be combined with block sampling?

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
        <https://reviews.apache.org/r/633/#comment1191>

        This comment doesn't belong to this class.

        • Ning

        On 2011-04-20 18:28:29, Siying Dong wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/633/

        -----------------------------------------------------------

        (Updated 2011-04-20 18:28:29)

        Review request for hive, Ning Zhang and namit jain.

        Summary

        -------

        We need a better input sampling to serve at least two purposes:

        1. test their queries against a smaller data set

        2. understand more about how the data look like without scanning the whole table.

        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.

        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs

        -----

        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244

        trunk/conf/hive-default.xml 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244

        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION

        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244

        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION

        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION

        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION

        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION

        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244

        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244

        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244

        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244

        trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244

        Diff: https://reviews.apache.org/r/633/diff

        Testing

        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/#review558 ----------------------------------------------------------- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java < https://reviews.apache.org/r/633/#comment1178 > The naming of this parameter is a little bit confusing: the parameter key is called "randomnumber" but the value of it is a fixed number. Do you mean this number is actually the seed to generate samples? trunk/conf/hive-default.xml < https://reviews.apache.org/r/633/#comment1179 > same as above. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/633/#comment1195 > better add a comment for this function explain what the rationale behind sampling splits. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/633/#comment1192 > we should declare retLists as interface (List) rather than implementation (ArrayList) trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/633/#comment1193 > same here, should declare it as Map rather than HashMap. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java < https://reviews.apache.org/r/633/#comment1194 > can you add comments here? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java < https://reviews.apache.org/r/633/#comment1182 > can you add a comment on what this Map is used for and what are the key and value of the Map? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java < https://reviews.apache.org/r/633/#comment1184 > I think we don't need to introduce this parameter at all. For one it is a new feature rather than a different code path for an old feature. We don't need the "fallback" protection by a new parameter. Secondly, throwing a SemanticException here can only make the user asking how to solve this problem, which is to set the parameter to true. So it seems that it doesn't make sense to set the parameter to false in any cases. So why not remove the this parameter to make it cleaner. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java < https://reviews.apache.org/r/633/#comment1189 > you may want to check the percentage number (if it is a valid double and within the range [0,100] ) and throw SemanticException if it is invalid before creating a SplitSample object. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java < https://reviews.apache.org/r/633/#comment1190 > why limit cannot be combined with block sampling? trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java < https://reviews.apache.org/r/633/#comment1191 > This comment doesn't belong to this class. Ning On 2011-04-20 18:28:29, Siying Dong wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- (Updated 2011-04-20 18:28:29) Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 trunk/conf/hive-default.xml 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        Ning Zhang added a comment -

        I will take a look.

        Show
        Ning Zhang added a comment - I will take a look.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/633/
        -----------------------------------------------------------

        Review request for hive, Ning Zhang and namit jain.

        Summary
        -------

        We need a better input sampling to serve at least two purposes:
        1. test their queries against a smaller data set
        2. understand more about how the data look like without scanning the whole table.
        A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling.

        This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs.

        This addresses bug HIVE-2121.
        https://issues.apache.org/jira/browse/HIVE-2121

        Diffs


        trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244
        trunk/conf/hive-default.xml 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244
        trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION
        trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244
        trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION
        trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION
        trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION
        trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION
        trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244
        trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244
        trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244
        trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244
        trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244

        Diff: https://reviews.apache.org/r/633/diff

        Testing
        -------

        TestCliDriver TestNegativeCliDriver, manual tests on real clusters.

        Thanks,

        Siying

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121 . https://issues.apache.org/jira/browse/HIVE-2121 Diffs trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 trunk/conf/hive-default.xml 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying
        Hide
        Siying Dong added a comment -

        changed test outputs. test outputs of sample tests changed as I chanced token name for sampling using buckets from TOK_TABLESAMPLE to TOK_TABLEBUCKETSAMPLE (as I am adding a TOK_TABLESPLITSAMPLE). Other than that, there should be no change.

        This patch has a limit: even if we only sample one split out, execution is not switched to local mode if possible, as getSplits() is called in job submit part, which already passed the step to choose local mode.

        Show
        Siying Dong added a comment - changed test outputs. test outputs of sample tests changed as I chanced token name for sampling using buckets from TOK_TABLESAMPLE to TOK_TABLEBUCKETSAMPLE (as I am adding a TOK_TABLESPLITSAMPLE). Other than that, there should be no change. This patch has a limit: even if we only sample one split out, execution is not switched to local mode if possible, as getSplits() is called in job submit part, which already passed the step to choose local mode.
        Hide
        Ning Zhang added a comment -

        Siying, can you create a review request on the review board?

        Show
        Ning Zhang added a comment - Siying, can you create a review request on the review board?
        Hide
        Siying Dong added a comment -

        still need to modify some test results.

        Show
        Siying Dong added a comment - still need to modify some test results.

          People

          • Assignee:
            Siying Dong
            Reporter:
            Siying Dong
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development