Hive
  1. Hive
  2. HIVE-2780

Implement more restrictive table sampler

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Current table sampling scans whole block, making more rows included than expected especially for small tables.

      1. ASF.LICENSE.NOT.GRANTED--HIVE-2780.D1623.1.patch
        9 kB
        Phabricator
      2. ASF.LICENSE.NOT.GRANTED--HIVE-2780.D1623.2.patch
        9 kB
        Phabricator
      3. HIVE-2780.D1623.3.patch
        11 kB
        Phabricator
      4. HIVE-2780.D1623.4.patch
        17 kB
        Phabricator

        Issue Links

          Activity

          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2780 [jira] Implement more restrictive table sampler".

          Addressed comments & rebased to trunk

          Reviewers: ashutoshc, JIRA

          REVISION DETAIL
          https://reviews.facebook.net/D1623

          CHANGE SINCE LAST DIFF
          https://reviews.facebook.net/D1623?vs=25479&id=28587#toc

          AFFECTED FILES
          common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          conf/hive-default.xml.template
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
          ql/src/java/org/apache/hadoop/hive/ql/io/DefaultSplitSampler.java
          ql/src/java/org/apache/hadoop/hive/ql/io/ShirinkSplitSampler.java
          ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java
          ql/src/test/queries/clientpositive/split_sample_custom.q
          ql/src/test/results/clientpositive/split_sample_custom.q.out

          To: JIRA, ashutoshc, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2780 [jira] Implement more restrictive table sampler". Addressed comments & rebased to trunk Reviewers: ashutoshc, JIRA REVISION DETAIL https://reviews.facebook.net/D1623 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D1623?vs=25479&id=28587#toc AFFECTED FILES common/src/java/org/apache/hadoop/hive/conf/HiveConf.java conf/hive-default.xml.template ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java ql/src/java/org/apache/hadoop/hive/ql/io/DefaultSplitSampler.java ql/src/java/org/apache/hadoop/hive/ql/io/ShirinkSplitSampler.java ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java ql/src/test/queries/clientpositive/split_sample_custom.q ql/src/test/results/clientpositive/split_sample_custom.q.out To: JIRA, ashutoshc, navis
          Hide
          Phabricator added a comment -

          navis has commented on the revision "HIVE-2780 [jira] Implement more restrictive table sampler".

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:489 ok.
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:583 ok.
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:657 I remember the code is copied from CombineHiveInputFormat. I'll check that.
          ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java:34 ok.
          ql/src/test/results/clientpositive/split_sample_sampler.q.out:27 Original implementation provided split level granularity and the purpose of this patch is making it smaller (per row). This means underlying files should be splittable, which you pointed out previously.

          REVISION DETAIL
          https://reviews.facebook.net/D1623

          BRANCH
          DPAL-722

          To: JIRA, ashutoshc, navis

          Show
          Phabricator added a comment - navis has commented on the revision " HIVE-2780 [jira] Implement more restrictive table sampler". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:489 ok. ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:583 ok. ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:657 I remember the code is copied from CombineHiveInputFormat. I'll check that. ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java:34 ok. ql/src/test/results/clientpositive/split_sample_sampler.q.out:27 Original implementation provided split level granularity and the purpose of this patch is making it smaller (per row). This means underlying files should be splittable, which you pointed out previously. REVISION DETAIL https://reviews.facebook.net/D1623 BRANCH DPAL-722 To: JIRA, ashutoshc, navis
          Hide
          Phabricator added a comment -

          ashutoshc has requested changes to the revision "HIVE-2780 [jira] Implement more restrictive table sampler".

          Few comments.

          INLINE COMMENTS
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:489 This config needs to be added to HiveConf.java and in hive-site.xml.template with description. Also, indicate that alternate sampler is available if someone wants to use it.
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:583 Instead of growing this file further, I think it will make sense to put this class in its own java file. Also, can you please also add comments on algorithm which this sampler follows.
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:620 Instead of growing this file further, I think it will make sense to put this class in its own java file. Also, can you please also add comments on algorithm which this sampler follows.
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:657 I assume split is splitable only if its either FileInputFormat or uncompressed TextInputFormat. Is that correct ? If so, I think it will be easier to read this logic if its written as follows:

          if ( if instanceof FileIF || if instanceof mapreduce.FileIF || (if instanceof TextIF && !uncompressed))
          return true
          else
          return false
          ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java:34 Please document the contract of this interface.
          ql/src/test/results/clientpositive/split_sample_sampler.q.out:27 Just because sampler is different, value of count( * ) should not change ? But you got 8 with HeadSampler, but 118 with Default Sampler ?
          ql/src/test/results/clientpositive/split_sample_sampler.q.out:36 Just because sampler is different, value of count( * ) should not change ? But you got 8 with HeadSampler, but 20 with Default Sampler ? Also, default sampler generated same number 118 for both percent as well as Bytes, but Head sampler got different values. Whats the reason for that ?

          REVISION DETAIL
          https://reviews.facebook.net/D1623

          BRANCH
          DPAL-722

          To: JIRA, ashutoshc, navis

          Show
          Phabricator added a comment - ashutoshc has requested changes to the revision " HIVE-2780 [jira] Implement more restrictive table sampler". Few comments. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:489 This config needs to be added to HiveConf.java and in hive-site.xml.template with description. Also, indicate that alternate sampler is available if someone wants to use it. ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:583 Instead of growing this file further, I think it will make sense to put this class in its own java file. Also, can you please also add comments on algorithm which this sampler follows. ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:620 Instead of growing this file further, I think it will make sense to put this class in its own java file. Also, can you please also add comments on algorithm which this sampler follows. ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java:657 I assume split is splitable only if its either FileInputFormat or uncompressed TextInputFormat. Is that correct ? If so, I think it will be easier to read this logic if its written as follows: if ( if instanceof FileIF || if instanceof mapreduce.FileIF || (if instanceof TextIF && !uncompressed)) return true else return false ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java:34 Please document the contract of this interface. ql/src/test/results/clientpositive/split_sample_sampler.q.out:27 Just because sampler is different, value of count( * ) should not change ? But you got 8 with HeadSampler, but 118 with Default Sampler ? ql/src/test/results/clientpositive/split_sample_sampler.q.out:36 Just because sampler is different, value of count( * ) should not change ? But you got 8 with HeadSampler, but 20 with Default Sampler ? Also, default sampler generated same number 118 for both percent as well as Bytes, but Head sampler got different values. Whats the reason for that ? REVISION DETAIL https://reviews.facebook.net/D1623 BRANCH DPAL-722 To: JIRA, ashutoshc, navis
          Hide
          Ashutosh Chauhan added a comment -

          Some comments on phabricator.

          Show
          Ashutosh Chauhan added a comment - Some comments on phabricator.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks for updating patch. Marking Patch Available. Reviewing it now.

          Show
          Ashutosh Chauhan added a comment - Thanks for updating patch. Marking Patch Available. Reviewing it now.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2780 [jira] Implement more restrictive table sampler".
          Reviewers: JIRA

          Rebased to trunk & add support for total length sampling

          REVISION DETAIL
          https://reviews.facebook.net/D1623

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
          ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java
          ql/src/test/queries/clientpositive/split_sample_sampler.q
          ql/src/test/results/clientpositive/split_sample_sampler.q.out

          To: JIRA, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2780 [jira] Implement more restrictive table sampler". Reviewers: JIRA Rebased to trunk & add support for total length sampling REVISION DETAIL https://reviews.facebook.net/D1623 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java ql/src/java/org/apache/hadoop/hive/ql/io/SplitSampler.java ql/src/test/queries/clientpositive/split_sample_sampler.q ql/src/test/results/clientpositive/split_sample_sampler.q.out To: JIRA, navis
          Hide
          Ashutosh Chauhan added a comment -

          My manually conflict-resolved patch resulted in failure in split_sample.q

              [junit] java.lang.NullPointerException
              [junit]     at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$DefaultPercentSampler.sampling(CombineHiveInputFormat.java:596)
              [junit]     at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.sampling(CombineHiveInputFormat.java:496)
              [junit]     at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.sampleSplits(CombineHiveInputFormat.java:477)
              [junit]     at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:403)
              [junit]     at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
              [junit]     at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
              [junit]     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
              [junit]     at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
              [junit]     at org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:682)
              [junit]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              [junit]     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
              [junit]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              [junit]     at java.lang.reflect.Method.invoke(Method.java:597)
              [junit]     at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
              [junit] Job Submission failed with exception 'java.lang.NullPointerException(null)'
          

          Either my resolution wasn't correct or trunk has moved significantly. Navis, if you rebase the patch, I will take a look at this one quickly so that it doesnt go stale again.

          Show
          Ashutosh Chauhan added a comment - My manually conflict-resolved patch resulted in failure in split_sample.q [junit] java.lang.NullPointerException [junit] at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$DefaultPercentSampler.sampling(CombineHiveInputFormat.java:596) [junit] at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.sampling(CombineHiveInputFormat.java:496) [junit] at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.sampleSplits(CombineHiveInputFormat.java:477) [junit] at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:403) [junit] at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) [junit] at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) [junit] at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448) [junit] at org.apache.hadoop.hive.ql.exec.ExecDriver.main(ExecDriver.java:682) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.apache.hadoop.util.RunJar.main(RunJar.java:156) [junit] Job Submission failed with exception 'java.lang.NullPointerException( null )' Either my resolution wasn't correct or trunk has moved significantly. Navis, if you rebase the patch, I will take a look at this one quickly so that it doesnt go stale again.
          Hide
          Ashutosh Chauhan added a comment -

          I believe this new head sampler can only be used for FileInputFormat and only when its uncompressed. Correct? I think since thats a pretty common setup, its useful even in its restricted form. I applied the patch, but there were conflicts, I manually resolved the conflicts. Running tests now.

          Show
          Ashutosh Chauhan added a comment - I believe this new head sampler can only be used for FileInputFormat and only when its uncompressed. Correct? I think since thats a pretty common setup, its useful even in its restricted form. I applied the patch, but there were conflicts, I manually resolved the conflicts. Running tests now.
          Hide
          Phabricator added a comment -

          navis updated the revision "HIVE-2780 [jira] Implement more restrictive table sampler".
          Reviewers: JIRA

          Rebased on trunk

          REVISION DETAIL
          https://reviews.facebook.net/D1623

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
          ql/src/java/org/apache/hadoop/hive/ql/io/PercentSampler.java
          ql/src/test/queries/clientpositive/tablesamples.q
          ql/src/test/results/clientpositive/tablesamples.q.out

          To: JIRA, navis

          Show
          Phabricator added a comment - navis updated the revision " HIVE-2780 [jira] Implement more restrictive table sampler". Reviewers: JIRA Rebased on trunk REVISION DETAIL https://reviews.facebook.net/D1623 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java ql/src/java/org/apache/hadoop/hive/ql/io/PercentSampler.java ql/src/test/queries/clientpositive/tablesamples.q ql/src/test/results/clientpositive/tablesamples.q.out To: JIRA, navis
          Hide
          Phabricator added a comment -

          navis requested code review of "HIVE-2780 [jira] Implement more restrictive table sampler".
          Reviewers: JIRA

          DPAL-722 Implement head sampler for each blocks

          Current table sampling scans whole block, making more rows included than expected especially for small tables.

          TEST PLAN
          EMPTY

          REVISION DETAIL
          https://reviews.facebook.net/D1623

          AFFECTED FILES
          ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
          ql/src/java/org/apache/hadoop/hive/ql/io/PercentSampler.java
          ql/src/test/queries/clientpositive/tablesamples.q
          ql/src/test/results/clientpositive/tablesamples.q.out

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/3453/

          Tip: use the X-Herald-Rules header to filter Herald messages in your client.

          Show
          Phabricator added a comment - navis requested code review of " HIVE-2780 [jira] Implement more restrictive table sampler". Reviewers: JIRA DPAL-722 Implement head sampler for each blocks Current table sampling scans whole block, making more rows included than expected especially for small tables. TEST PLAN EMPTY REVISION DETAIL https://reviews.facebook.net/D1623 AFFECTED FILES ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java ql/src/java/org/apache/hadoop/hive/ql/io/PercentSampler.java ql/src/test/queries/clientpositive/tablesamples.q ql/src/test/results/clientpositive/tablesamples.q.out MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/3453/ Tip: use the X-Herald-Rules header to filter Herald messages in your client.

            People

            • Assignee:
              Navis
              Reporter:
              Navis
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development