Details

    • Type: Task
    • Status: Resolved
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.12.0
    • Labels:
    • old issue number:
      22

      Description

      Support the standard SQL TABLESAMPLE clause by implementing a filter that uses a skip next hint based on the region boundaries of the table to only return n rows per region.

      When TABLESAMPLE clause is used, Phoenix will sample (N) percent of the the hbase table with only O(M) run time complexity. (N is size of table, M is size of stats)

      [Update]
      Usage:
      https://phoenix.apache.org/tablesample.html

      Syntax of using table sampling:
      select * from PERSON TABLESAMPLE(45);
      select count( * ) from PERSON TABLESAMPLE (49) LIMIT 2

      Source Code:
      https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25

        Issue Links

          Activity

          Hide
          hanzhizhang hanzhi added a comment -

          Awesome!!!!!!

          Show
          hanzhizhang hanzhi added a comment - Awesome!!!!!!
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Nice job Ethan Wang!

          Show
          lhofhansl Lars Hofhansl added a comment - Nice job Ethan Wang !
          Hide
          aertoria Ethan Wang added a comment -

          Thanks James Taylor!

          Show
          aertoria Ethan Wang added a comment - Thanks James Taylor !
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Phoenix-master #1725 (See https://builds.apache.org/job/Phoenix-master/1725/)
          PHOENIX-153 Implement TABLESAMPLE clause (Ethan Wang) (jamestaylor: rev 5e33dc12bc088bd0008d89f0a5cd7d5c368efa25)

          • (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/ParseNodeFactory.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/optimize/QueryOptimizer.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/FilterableStatement.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java
          • (add) phoenix-core/src/main/java/org/apache/phoenix/iterate/TableSamplerPredicate.java
          • (add) phoenix-core/src/it/java/org/apache/phoenix/end2end/QueryWithTableSampleIT.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/SelectStatement.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/DeleteStatement.java
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/ConcreteTableNode.java
          • (edit) phoenix-core/src/main/antlr3/PhoenixSQL.g
          • (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/NamedTableNode.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Phoenix-master #1725 (See https://builds.apache.org/job/Phoenix-master/1725/ ) PHOENIX-153 Implement TABLESAMPLE clause (Ethan Wang) (jamestaylor: rev 5e33dc12bc088bd0008d89f0a5cd7d5c368efa25) (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/ParseNodeFactory.java (edit) phoenix-core/src/main/java/org/apache/phoenix/optimize/QueryOptimizer.java (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/FilterableStatement.java (edit) phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java (add) phoenix-core/src/main/java/org/apache/phoenix/iterate/TableSamplerPredicate.java (add) phoenix-core/src/it/java/org/apache/phoenix/end2end/QueryWithTableSampleIT.java (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/SelectStatement.java (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/DeleteStatement.java (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/ConcreteTableNode.java (edit) phoenix-core/src/main/antlr3/PhoenixSQL.g (edit) phoenix-core/src/main/java/org/apache/phoenix/parse/NamedTableNode.java
          Hide
          jamestaylor James Taylor added a comment -

          Pushed to master and 4.x branches. Great first commit, Ethan Wang!

          Show
          jamestaylor James Taylor added a comment - Pushed to master and 4.x branches. Great first commit, Ethan Wang !
          Hide
          jamestaylor James Taylor added a comment -

          +1. Nice work, Ethan Wang. Will let you know if patch doesn't apply cleanly to other 4.x branches.

          Show
          jamestaylor James Taylor added a comment - +1. Nice work, Ethan Wang . Will let you know if patch doesn't apply cleanly to other 4.x branches.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user aertoria commented on the issue:

          https://github.com/apache/phoenix/pull/262

          Commit 0507d4f change list:
          1, explain plan a new way (Thanks for the suggestion)
          2, squash previous four commit into one
          3, revise all commit message to start with PHOENIX-153+space

          preview on a Single select
          `CLIENT 3-CHUNK 30 ROWS 2370 BYTES PARALLEL 1-WAY 0.2-SAMPLED ROUND ROBIN FULL SCAN OVER PERSON`

          on a Join select
          ```
          CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY 0.65-SAMPLED ROUND ROBIN FULL SCAN OVER INX_ADDRESS_PERSON
          SERVER FILTER BY FIRST KEY ONLY
          PARALLEL INNER-JOIN TABLE 0
          CLIENT 1-CHUNK 1 ROWS 32 BYTES PARALLEL 1-WAY 0.15-SAMPLED ROUND ROBIN FULL SCAN OVER US_POPULATION
          DYNAMIC SERVER FILTER BY TO_CHAR("INX_ADDRESS_PERSON.0:ADDRESS") IN (US_POPULATION.STATE)
          ```

          Show
          githubbot ASF GitHub Bot added a comment - Github user aertoria commented on the issue: https://github.com/apache/phoenix/pull/262 Commit 0507d4f change list: 1, explain plan a new way (Thanks for the suggestion) 2, squash previous four commit into one 3, revise all commit message to start with PHOENIX-153 +space preview on a Single select `CLIENT 3-CHUNK 30 ROWS 2370 BYTES PARALLEL 1-WAY 0.2-SAMPLED ROUND ROBIN FULL SCAN OVER PERSON` on a Join select ``` CLIENT 1-CHUNK 0 ROWS 0 BYTES PARALLEL 1-WAY 0.65-SAMPLED ROUND ROBIN FULL SCAN OVER INX_ADDRESS_PERSON SERVER FILTER BY FIRST KEY ONLY PARALLEL INNER-JOIN TABLE 0 CLIENT 1-CHUNK 1 ROWS 32 BYTES PARALLEL 1-WAY 0.15-SAMPLED ROUND ROBIN FULL SCAN OVER US_POPULATION DYNAMIC SERVER FILTER BY TO_CHAR("INX_ADDRESS_PERSON.0:ADDRESS") IN (US_POPULATION.STATE) ```
          Hide
          aertoria Ethan Wang added a comment -

          Make sense. Thanks James Taylor

          Show
          aertoria Ethan Wang added a comment - Make sense. Thanks James Taylor
          Hide
          jamestaylor James Taylor added a comment -

          Seems like review comments aren't appearing here in JIRA (maybe because your commit message doesn't include the JIRA number in the expected format), so I'll repeat it here:

          Let's move the explain for the sampling into the first line, before we recurse down for the other steps. You can put it on the same line, after the "-WAY " like this:

          CLIENT PARALLEL 1-WAY 0.48-SAMPLED ...

          Otherwise, users will interpret the sampling as happening after the scan/filtering which isn't the case.

          Show
          jamestaylor James Taylor added a comment - Seems like review comments aren't appearing here in JIRA (maybe because your commit message doesn't include the JIRA number in the expected format), so I'll repeat it here: Let's move the explain for the sampling into the first line, before we recurse down for the other steps. You can put it on the same line, after the "-WAY " like this: CLIENT PARALLEL 1-WAY 0.48-SAMPLED ... Otherwise, users will interpret the sampling as happening after the scan/filtering which isn't the case.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user JamesRTaylor commented on the issue:

          https://github.com/apache/phoenix/pull/262

          Ping @aertoria - would you have a few spare cycles to make that last change? Also, please squash all commits into one and amend your commit message to be prefixed with PHOENIX-153 (i.e. include the dash). Otherwise, we the pull request isn't tied to the JIRA.

          Show
          githubbot ASF GitHub Bot added a comment - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/262 Ping @aertoria - would you have a few spare cycles to make that last change? Also, please squash all commits into one and amend your commit message to be prefixed with PHOENIX-153 (i.e. include the dash). Otherwise, we the pull request isn't tied to the JIRA.
          Hide
          lhofhansl Lars Hofhansl added a comment -

          The default guidepost width is 300MB. Maybe we could go down to 10MB, once we have guidepost combining.
          Less than that will be a huge management burden to the system.

          Still a good thing to do! On small tables you do not need to sample in the first place, and for large tables - where it matters - we'll have sufficiently many guide posts. (A 1TB table has over 3000 300MB guideposts, i.e. you'll have a resolution of 0.03%, which is plenty good!)

          Show
          lhofhansl Lars Hofhansl added a comment - The default guidepost width is 300MB. Maybe we could go down to 10MB, once we have guidepost combining. Less than that will be a huge management burden to the system. Still a good thing to do! On small tables you do not need to sample in the first place, and for large tables - where it matters - we'll have sufficiently many guide posts. (A 1TB table has over 3000 300MB guideposts, i.e. you'll have a resolution of 0.03%, which is plenty good!)
          Hide
          aertoria Ethan Wang added a comment - - edited

          Valid Point.

          In addition, by design, this coarse problem gets magnified when three things happen (and vice versa):
          1, Table is too small
          2, Guidepost width set too wide, or even no stats collected at all
          3, User specifies to not use stats table for parallelization.

          Based on the observation from the testing on a table with 400K rows and GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of expected size. This performance gets better and better when the GuidePosts used are more granular (Detailed chart attached.)

          Note that in this chart,
          1, The test environment is a single node hbase cluster (1.2.2). Test table with random integer as PK, with about 400K rows.
          2, The Guide_Post_Width has been pre-set as 10K and 200K, respectively.
          3, FNV is used as the consistent hashing algorithm as TableSamplerPredicate (a.k.a., the "dice", used to hashing each guidepost and decide if it is going to be selected).

          Show
          aertoria Ethan Wang added a comment - - edited Valid Point. In addition, by design, this coarse problem gets magnified when three things happen (and vice versa): 1, Table is too small 2, Guidepost width set too wide, or even no stats collected at all 3, User specifies to not use stats table for parallelization. Based on the observation from the testing on a table with 400K rows and GUIDE_POSTS_WIDTH =10KB or 200KB, the sampled size was usually around +-5% of expected size. This performance gets better and better when the GuidePosts used are more granular (Detailed chart attached.) Note that in this chart, 1, The test environment is a single node hbase cluster (1.2.2). Test table with random integer as PK, with about 400K rows. 2, The Guide_Post_Width has been pre-set as 10K and 200K, respectively. 3, FNV is used as the consistent hashing algorithm as TableSamplerPredicate (a.k.a., the "dice", used to hashing each guidepost and decide if it is going to be selected).
          Hide
          lhofhansl Lars Hofhansl added a comment -

          Good idea. Skipping whole guideposts is pretty coarse, though.
          At the same time I cannot thing of anything else efficient.

          Show
          lhofhansl Lars Hofhansl added a comment - Good idea. Skipping whole guideposts is pretty coarse, though. At the same time I cannot thing of anything else efficient.
          Hide
          jamestaylor James Taylor added a comment -

          Yes, +1 to following Calcite syntax

          Show
          jamestaylor James Taylor added a comment - Yes, +1 to following Calcite syntax
          Hide
          aertoria Ethan Wang added a comment -

          +1. After some study about calcite/parse.jj and calcite/SqlValidatorFeatureTest.java, my understanding is that calcite seems to be very close to Postgres TABLESAMPLE syntax (which PHOENIX-153 is also designed to be similar with).

          I'd like to sum up two differences below (please correct me if I'm mistaken Julian Hyde).

          1, Calcite table sampling rate input is 0 to 100 (PHOENIX-153 currently is 0 to 1).
          2, Syntax difference
          Calcite: select name from dept TABLESAMPLE system(58)
          PHOENIX-153: select name from dept TABLESAMPLE 0.58

          Purposing change for PHOENIX-153: Let's change phoenix side to be
          select name from dept TABLESAMPLE(0.58)

          Thoughts?

          Reference:
          https://github.com/apache/calcite/blob/d619304070bf2874ab760c92ec2573ee6c19f536/piglet/src/main/javacc/PigletParser.jj

          https://github.com/apache/calcite/blob/0938c7b6d767e3242874d87a30d9112512d9243a/core/src/test/java/org/apache/calcite/test/SqlValidatorFeatureTest.java

          Show
          aertoria Ethan Wang added a comment - +1. After some study about calcite/parse.jj and calcite/SqlValidatorFeatureTest.java , my understanding is that calcite seems to be very close to Postgres TABLESAMPLE syntax (which PHOENIX-153 is also designed to be similar with). I'd like to sum up two differences below (please correct me if I'm mistaken Julian Hyde ). 1, Calcite table sampling rate input is 0 to 100 ( PHOENIX-153 currently is 0 to 1). 2, Syntax difference Calcite: select name from dept TABLESAMPLE system(58) PHOENIX-153 : select name from dept TABLESAMPLE 0.58 Purposing change for PHOENIX-153 : Let's change phoenix side to be select name from dept TABLESAMPLE(0.58) Thoughts? Reference: https://github.com/apache/calcite/blob/d619304070bf2874ab760c92ec2573ee6c19f536/piglet/src/main/javacc/PigletParser.jj https://github.com/apache/calcite/blob/0938c7b6d767e3242874d87a30d9112512d9243a/core/src/test/java/org/apache/calcite/test/SqlValidatorFeatureTest.java
          Hide
          jamestaylor James Taylor added a comment -

          +1. Do you have something you can point us to for the Calcite TABLESAMPLE syntax?

          Show
          jamestaylor James Taylor added a comment - +1. Do you have something you can point us to for the Calcite TABLESAMPLE syntax?
          Hide
          julianhyde Julian Hyde added a comment -

          Since Calcite already supports TABLESAMPLE let's save ourselves a headache and make sure that the 4.x syntax is compatible with Calcite's syntax.

          Show
          julianhyde Julian Hyde added a comment - Since Calcite already supports TABLESAMPLE let's save ourselves a headache and make sure that the 4.x syntax is compatible with Calcite's syntax.
          Hide
          aertoria Ethan Wang added a comment -
          Show
          aertoria Ethan Wang added a comment - P.R link https://github.com/apache/phoenix/pull/262
          Hide
          aertoria Ethan Wang added a comment - - edited

          Spec of this patch. Feedback plz.

          Updated Jun25. Syntax made similar to Calcite and PostgresSQL

          ++
          ++Belows are SUPPORTED
          ++
          ===BASE CASE====
          select * from Person;
          select * from PERSON TABLESAMPLE(45);
          select * from PERSON TABLESAMPLE (45);

          ===WHERE CLAUSE====
          select * from PERSON where ADDRESS = 'CA' OR name>'tina3';
          select * from PERSON TABLESAMPLE (49) where ADDRESS = 'CA' OR name>'tina3';
          select * from PERSON TABLESAMPLE (49) where ADDRESS = 'CA' OR name>'tina3' ORDER BY ADDRESS LIMIT 2;

          ===Wired Table===
          select * from LOCAL_ADDRESS TABLESAMPLE (79);
          select * from SYSTEM.STATS TABLESAMPLE (41);

          ===CORNER CASE===
          select * from PERSON TABLESAMPLE (0);
          select * from PERSON TABLESAMPLE (100.0);
          select * from PERSON TABLESAMPLE (145.99); (out of boundary exception)
          select * from PERSON TABLESAMPLE ko; (syntax error exception)

          ===AGGREGATION===
          select count( * ) from PERSON TABLESAMPLE (49) LIMIT 2
          select count( * ) from (select NAME from PERSON TABLESAMPLE (49) limit 20)

          ===SUB QUERY===
          select * from (select /+NO_INDEX/ * from PERSON tablesample (10) where Name > 'tina10') where ADDRESS = 'CA'

          ===JOINS===
          select * from PERSON1, PERSON2 tablesample (70) where PERSON1.Name = PERSON2.NAME
          select /+NO_INDEX/ count( * ) from PERSON tableSample (19), US_POPULATION tableSample (28) where PERSON.Name > US_POPULATION.STATE

          ===QUERY being OPTMIZED===
          select * from PERSON tablesample (80) (may go to IDX_ADDRESS_PERSON index table, but table sampling rate carry on)

          ===INSERT SELECT====
          upsert into personbig(ID, ADDRESS) select id, address from personbig tablesample (1);

          Show
          aertoria Ethan Wang added a comment - - edited Spec of this patch. Feedback plz. Updated Jun25. Syntax made similar to Calcite and PostgresSQL ++ ++Belows are SUPPORTED ++ ===BASE CASE==== select * from Person; select * from PERSON TABLESAMPLE(45); select * from PERSON TABLESAMPLE (45); ===WHERE CLAUSE==== select * from PERSON where ADDRESS = 'CA' OR name>'tina3'; select * from PERSON TABLESAMPLE (49) where ADDRESS = 'CA' OR name>'tina3'; select * from PERSON TABLESAMPLE (49) where ADDRESS = 'CA' OR name>'tina3' ORDER BY ADDRESS LIMIT 2; ===Wired Table=== select * from LOCAL_ADDRESS TABLESAMPLE (79); select * from SYSTEM.STATS TABLESAMPLE (41); ===CORNER CASE=== select * from PERSON TABLESAMPLE (0); select * from PERSON TABLESAMPLE (100.0); select * from PERSON TABLESAMPLE (145.99); (out of boundary exception) select * from PERSON TABLESAMPLE ko; (syntax error exception) ===AGGREGATION=== select count( * ) from PERSON TABLESAMPLE (49) LIMIT 2 select count( * ) from (select NAME from PERSON TABLESAMPLE (49) limit 20) ===SUB QUERY=== select * from (select / +NO_INDEX / * from PERSON tablesample (10) where Name > 'tina10') where ADDRESS = 'CA' ===JOINS=== select * from PERSON1, PERSON2 tablesample (70) where PERSON1.Name = PERSON2.NAME select / +NO_INDEX / count( * ) from PERSON tableSample (19), US_POPULATION tableSample (28) where PERSON.Name > US_POPULATION.STATE ===QUERY being OPTMIZED=== select * from PERSON tablesample (80) (may go to IDX_ADDRESS_PERSON index table, but table sampling rate carry on) ===INSERT SELECT==== upsert into personbig(ID, ADDRESS) select id, address from personbig tablesample (1);
          Hide
          aertoria Ethan Wang added a comment -

          Implementation Proposal. (Feedback plz)
          Proposing table sampling on each Table basis (at the 'FROM' part of the query). Sample size decided by the input sampling rate applies on Primary Key's frequency.

          Syntax:
          `select name from person SAMPLE(0.10) where name='ethan'`

          Returns:
          `person SAMPLE(0.10)` part returns rows about 10% volume of the PERSON table. Reducing performance cost from PERSON table scan to Person-STATS table scan.

          Implementation detail:
          For table PERSON, assume STATS is populated with GuidePost inserted on every other PK (50% coverage).
          Step1, within the query scanning keyrange, iterate through the STATS table.
          Step2, for every GuidePost encountered, consult with a random number generator to decide if this guidepost will be included or excluded from the sample. This dice has 10% chance of winning.
          Step3, Once we decide to include this GuidePost, every PK on the original PERSON table that is between this-GuidePost and next-GuidePost will be included to the final sample.
          Repeat this process untill all GuidePost are visited.

          Example:

          PERSON

          ID(PK)
          1
          2
          3
          4
          5
          6

          STATS

          GuidePost
          1
          3
          5

          During dice rolling process, GuidePost 3 is included. PK between [3,5) will be included. The final result will be rows with PK 3, 4.

          This implementation,
          a, similar to Microsoft SQLServer TABLESAMPLE, focus mainly on the performance benefit. It does not guarantee the even distribution of the sample on original table (representativity).
          b, it works well on any GUIDE_POST_SWIDTH on any input sample rate. However, if the table is too small, the sample output may include rows more or less than the expected count (sample_rate X table_size)

          --------------------------------------------------------------------------------

          Summary of other popular TABLESAMPLE implementations.
          Basically two categories:
          1, Sampling on Query Basis.
          (Such as Blink DB. https://sameeragarwal.github.io/blinkdb_eurosys13.pdf)
          This implementation places sampling process based on entire query. such as:
          `select name from person where name='ethan' SAMPLE WITH ERROR 10% CONFIDENCE 95%'

          BlinkDB did so by assuming "the data used for similar grouping and filtering clause does not change over time across future queries". Based on heuristic experience, query engine pre-build certain stratify sample groups extracted from the actual table, cache them, and use them for evaluating an approximate result for some expensive queries. Therefore to avoid full table scan.

          This approach:
          a, Optimizes for the best performance-accuracy-trade-off. Once given the accuracy tolerance, it automatically decide the sampling rate for user.
          b, Engine takes filtering and grouping into consideration therefore it's powerful. But on the other side it may not perform at the same level for all kinds of queries.
          c, Based on heuristic info, there will be a machine gradually learning process.

          2, Sampling on Table Basis.
          (Such as Postgres, MS SQLServer. https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation)
          This approach places Tablesample only on the "FROM" part of the query. such as:
          `select name from person TABLESAMPLE(10 PERCENT) where name='ethan'`

          This approach first sample the original table to a smaller 'view' based on the Primary Key frequency and a given sampling rate. Then that 'view' will participate into the rest part of the query in place of original table.

          Usually a randomly selection process is used during the view creation. In MS SQLServer, a linear one-pass pointer travel through each "page", and ask a random generator to decide if this page will be part of the sample. Once accepted, every single row on this page now become part of new sample.

          This MSSQL tablesample
          a, gives flexibility satisfying any sampling rate.
          b, gain performance by reducing the length of a table scan (but big O complexity still the same)
          c, only care about the performance gain, does't care about sample distribution.

          James Taylor Geoffrey Jacoby
          Samarth Jain

          Show
          aertoria Ethan Wang added a comment - Implementation Proposal. (Feedback plz) Proposing table sampling on each Table basis (at the 'FROM' part of the query). Sample size decided by the input sampling rate applies on Primary Key's frequency. Syntax: `select name from person SAMPLE(0.10) where name='ethan'` Returns: `person SAMPLE(0.10)` part returns rows about 10% volume of the PERSON table. Reducing performance cost from PERSON table scan to Person-STATS table scan. Implementation detail: For table PERSON, assume STATS is populated with GuidePost inserted on every other PK (50% coverage). Step1, within the query scanning keyrange, iterate through the STATS table. Step2, for every GuidePost encountered, consult with a random number generator to decide if this guidepost will be included or excluded from the sample. This dice has 10% chance of winning. Step3, Once we decide to include this GuidePost, every PK on the original PERSON table that is between this-GuidePost and next-GuidePost will be included to the final sample. Repeat this process untill all GuidePost are visited. Example: PERSON ID(PK) 1 2 3 4 5 6 STATS GuidePost 1 3 5 During dice rolling process, GuidePost 3 is included. PK between [3,5) will be included. The final result will be rows with PK 3, 4. This implementation, a, similar to Microsoft SQLServer TABLESAMPLE, focus mainly on the performance benefit. It does not guarantee the even distribution of the sample on original table (representativity). b, it works well on any GUIDE_POST_SWIDTH on any input sample rate. However, if the table is too small, the sample output may include rows more or less than the expected count (sample_rate X table_size) -------------------------------------------------------------------------------- Summary of other popular TABLESAMPLE implementations. Basically two categories: 1, Sampling on Query Basis. (Such as Blink DB. https://sameeragarwal.github.io/blinkdb_eurosys13.pdf ) This implementation places sampling process based on entire query. such as: `select name from person where name='ethan' SAMPLE WITH ERROR 10% CONFIDENCE 95%' BlinkDB did so by assuming "the data used for similar grouping and filtering clause does not change over time across future queries". Based on heuristic experience, query engine pre-build certain stratify sample groups extracted from the actual table, cache them, and use them for evaluating an approximate result for some expensive queries. Therefore to avoid full table scan. This approach: a, Optimizes for the best performance-accuracy-trade-off. Once given the accuracy tolerance, it automatically decide the sampling rate for user. b, Engine takes filtering and grouping into consideration therefore it's powerful. But on the other side it may not perform at the same level for all kinds of queries. c, Based on heuristic info, there will be a machine gradually learning process. 2, Sampling on Table Basis. (Such as Postgres, MS SQLServer. https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation ) This approach places Tablesample only on the "FROM" part of the query. such as: `select name from person TABLESAMPLE(10 PERCENT) where name='ethan'` This approach first sample the original table to a smaller 'view' based on the Primary Key frequency and a given sampling rate. Then that 'view' will participate into the rest part of the query in place of original table. Usually a randomly selection process is used during the view creation. In MS SQLServer, a linear one-pass pointer travel through each "page", and ask a random generator to decide if this page will be part of the sample. Once accepted, every single row on this page now become part of new sample. This MSSQL tablesample a, gives flexibility satisfying any sampling rate. b, gain performance by reducing the length of a table scan (but big O complexity still the same) c, only care about the performance gain, does't care about sample distribution. James Taylor Geoffrey Jacoby Samarth Jain
          Hide
          jamestaylor James Taylor added a comment -

          Another potential means of doing sampling (as pointed out by Lars Hofhansl) is when a table is salted. Given that a salt is limited to a single byte, this may not work that well. However, if multiple bytes could be specified (PHOENIX-3574), that would be a good option.

          Show
          jamestaylor James Taylor added a comment - Another potential means of doing sampling (as pointed out by Lars Hofhansl ) is when a table is salted. Given that a salt is limited to a single byte, this may not work that well. However, if multiple bytes could be specified ( PHOENIX-3574 ), that would be a good option.

            People

            • Assignee:
              aertoria Ethan Wang
              Reporter:
              jamestaylor James Taylor
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development