Details

    • Type: Task
    • Status: Patch Available
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
    • old issue number:
      22

      Description

      Support the standard SQL TABLESAMPLE clause by implementing a filter that uses a skip next hint based on the region boundaries of the table to only return n rows per region.

        Activity

        Hide
        jamestaylor James Taylor added a comment -

        Another potential means of doing sampling (as pointed out by Lars Hofhansl) is when a table is salted. Given that a salt is limited to a single byte, this may not work that well. However, if multiple bytes could be specified (PHOENIX-3574), that would be a good option.

        Show
        jamestaylor James Taylor added a comment - Another potential means of doing sampling (as pointed out by Lars Hofhansl ) is when a table is salted. Given that a salt is limited to a single byte, this may not work that well. However, if multiple bytes could be specified ( PHOENIX-3574 ), that would be a good option.
        Hide
        aertoria Ethan Wang added a comment -

        Implementation Proposal. (Feedback plz)
        Proposing table sampling on each Table basis (at the 'FROM' part of the query). Sample size decided by the input sampling rate applies on Primary Key's frequency.

        Syntax:
        `select name from person SAMPLE(0.10) where name='ethan'`

        Returns:
        `person SAMPLE(0.10)` part returns rows about 10% volume of the PERSON table. Reducing performance cost from PERSON table scan to Person-STATS table scan.

        Implementation detail:
        For table PERSON, assume STATS is populated with GuidePost inserted on every other PK (50% coverage).
        Step1, within the query scanning keyrange, iterate through the STATS table.
        Step2, for every GuidePost encountered, consult with a random number generator to decide if this guidepost will be included or excluded from the sample. This dice has 10% chance of winning.
        Step3, Once we decide to include this GuidePost, every PK on the original PERSON table that is between this-GuidePost and next-GuidePost will be included to the final sample.
        Repeat this process untill all GuidePost are visited.

        Example:

        PERSON

        ID(PK)
        1
        2
        3
        4
        5
        6

        STATS

        GuidePost
        1
        3
        5

        During dice rolling process, GuidePost 3 is included. PK between [3,5) will be included. The final result will be rows with PK 3, 4.

        This implementation,
        a, similar to Microsoft SQLServer TABLESAMPLE, focus mainly on the performance benefit. It does not guarantee the even distribution of the sample on original table (representativity).
        b, it works well on any GUIDE_POST_SWIDTH on any input sample rate. However, if the table is too small, the sample output may include rows more or less than the expected count (sample_rate X table_size)

        --------------------------------------------------------------------------------

        Summary of other popular TABLESAMPLE implementations.
        Basically two categories:
        1, Sampling on Query Basis.
        (Such as Blink DB. https://sameeragarwal.github.io/blinkdb_eurosys13.pdf)
        This implementation places sampling process based on entire query. such as:
        `select name from person where name='ethan' SAMPLE WITH ERROR 10% CONFIDENCE 95%'

        BlinkDB did so by assuming "the data used for similar grouping and filtering clause does not change over time across future queries". Based on heuristic experience, query engine pre-build certain stratify sample groups extracted from the actual table, cache them, and use them for evaluating an approximate result for some expensive queries. Therefore to avoid full table scan.

        This approach:
        a, Optimizes for the best performance-accuracy-trade-off. Once given the accuracy tolerance, it automatically decide the sampling rate for user.
        b, Engine takes filtering and grouping into consideration therefore it's powerful. But on the other side it may not perform at the same level for all kinds of queries.
        c, Based on heuristic info, there will be a machine gradually learning process.

        2, Sampling on Table Basis.
        (Such as Postgres, MS SQLServer. https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation)
        This approach places Tablesample only on the "FROM" part of the query. such as:
        `select name from person TABLESAMPLE(10 PERCENT) where name='ethan'`

        This approach first sample the original table to a smaller 'view' based on the Primary Key frequency and a given sampling rate. Then that 'view' will participate into the rest part of the query in place of original table.

        Usually a randomly selection process is used during the view creation. In MS SQLServer, a linear one-pass pointer travel through each "page", and ask a random generator to decide if this page will be part of the sample. Once accepted, every single row on this page now become part of new sample.

        This MSSQL tablesample
        a, gives flexibility satisfying any sampling rate.
        b, gain performance by reducing the length of a table scan (but big O complexity still the same)
        c, only care about the performance gain, does't care about sample distribution.

        James Taylor Geoffrey Jacoby
        Samarth Jain

        Show
        aertoria Ethan Wang added a comment - Implementation Proposal. (Feedback plz) Proposing table sampling on each Table basis (at the 'FROM' part of the query). Sample size decided by the input sampling rate applies on Primary Key's frequency. Syntax: `select name from person SAMPLE(0.10) where name='ethan'` Returns: `person SAMPLE(0.10)` part returns rows about 10% volume of the PERSON table. Reducing performance cost from PERSON table scan to Person-STATS table scan. Implementation detail: For table PERSON, assume STATS is populated with GuidePost inserted on every other PK (50% coverage). Step1, within the query scanning keyrange, iterate through the STATS table. Step2, for every GuidePost encountered, consult with a random number generator to decide if this guidepost will be included or excluded from the sample. This dice has 10% chance of winning. Step3, Once we decide to include this GuidePost, every PK on the original PERSON table that is between this-GuidePost and next-GuidePost will be included to the final sample. Repeat this process untill all GuidePost are visited. Example: PERSON ID(PK) 1 2 3 4 5 6 STATS GuidePost 1 3 5 During dice rolling process, GuidePost 3 is included. PK between [3,5) will be included. The final result will be rows with PK 3, 4. This implementation, a, similar to Microsoft SQLServer TABLESAMPLE, focus mainly on the performance benefit. It does not guarantee the even distribution of the sample on original table (representativity). b, it works well on any GUIDE_POST_SWIDTH on any input sample rate. However, if the table is too small, the sample output may include rows more or less than the expected count (sample_rate X table_size) -------------------------------------------------------------------------------- Summary of other popular TABLESAMPLE implementations. Basically two categories: 1, Sampling on Query Basis. (Such as Blink DB. https://sameeragarwal.github.io/blinkdb_eurosys13.pdf ) This implementation places sampling process based on entire query. such as: `select name from person where name='ethan' SAMPLE WITH ERROR 10% CONFIDENCE 95%' BlinkDB did so by assuming "the data used for similar grouping and filtering clause does not change over time across future queries". Based on heuristic experience, query engine pre-build certain stratify sample groups extracted from the actual table, cache them, and use them for evaluating an approximate result for some expensive queries. Therefore to avoid full table scan. This approach: a, Optimizes for the best performance-accuracy-trade-off. Once given the accuracy tolerance, it automatically decide the sampling rate for user. b, Engine takes filtering and grouping into consideration therefore it's powerful. But on the other side it may not perform at the same level for all kinds of queries. c, Based on heuristic info, there will be a machine gradually learning process. 2, Sampling on Table Basis. (Such as Postgres, MS SQLServer. https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation ) This approach places Tablesample only on the "FROM" part of the query. such as: `select name from person TABLESAMPLE(10 PERCENT) where name='ethan'` This approach first sample the original table to a smaller 'view' based on the Primary Key frequency and a given sampling rate. Then that 'view' will participate into the rest part of the query in place of original table. Usually a randomly selection process is used during the view creation. In MS SQLServer, a linear one-pass pointer travel through each "page", and ask a random generator to decide if this page will be part of the sample. Once accepted, every single row on this page now become part of new sample. This MSSQL tablesample a, gives flexibility satisfying any sampling rate. b, gain performance by reducing the length of a table scan (but big O complexity still the same) c, only care about the performance gain, does't care about sample distribution. James Taylor Geoffrey Jacoby Samarth Jain
        Hide
        aertoria Ethan Wang added a comment - - edited

        Spec of this patch. Feedback plz.
        ++
        ++Belows are SUPPORTED
        ++
        ===BASE CASE====
        select * from Person;
        select * from PERSON TABLESAMPLE 0.45;

        ===WHERE CLAUSE====
        select * from PERSON where ADDRESS = 'CA' OR name>'tina3';
        select * from PERSON TABLESAMPLE 0.49 where ADDRESS = 'CA' OR name>'tina3';
        select * from PERSON TABLESAMPLE 0.49 where ADDRESS = 'CA' OR name>'tina3' LIMIT 1;

        ===Wired Table===
        select * from LOCAL_ADDRESS TABLESAMPLE 0.79;
        select * from SYSTEM.STATS TABLESAMPLE 0.41;

        ===CORNER CASE===
        select * from PERSON TABLESAMPLE 0;
        select * from PERSON TABLESAMPLE 1.45;
        select * from PERSON TABLESAMPLE kko;

        ===AGGREGATION===
        select count( * ) from PERSON TABLESAMPLE 0.5 LIMIT 2
        select count( * ) from (select NAME from PERSON limit 20)

        ===SUB QUERY===
        select * from (select /+NO_INDEX/ * from PERSON tablesample 0.1 where Name > 'tina10') where ADDRESS = 'CA'

        ===JOINS===
        select * from PERSON1, PERSON2 tablesample 0.7 where PERSON1.Name = PERSON2.NAME

        ===QUERY being OPTMIZED===
        select * from PERSON tablesample 0.8 (goes to IDX_ADDRESS_PERSON index table, table sample carry on)

        ===INSERT SELECT====
        upsert into personbig(ID, ADDRESS) select id, address from personbig tablesample 0.01;

        Show
        aertoria Ethan Wang added a comment - - edited Spec of this patch. Feedback plz. ++ ++Belows are SUPPORTED ++ ===BASE CASE==== select * from Person; select * from PERSON TABLESAMPLE 0.45; ===WHERE CLAUSE==== select * from PERSON where ADDRESS = 'CA' OR name>'tina3'; select * from PERSON TABLESAMPLE 0.49 where ADDRESS = 'CA' OR name>'tina3'; select * from PERSON TABLESAMPLE 0.49 where ADDRESS = 'CA' OR name>'tina3' LIMIT 1; ===Wired Table=== select * from LOCAL_ADDRESS TABLESAMPLE 0.79; select * from SYSTEM.STATS TABLESAMPLE 0.41; ===CORNER CASE=== select * from PERSON TABLESAMPLE 0; select * from PERSON TABLESAMPLE 1.45; select * from PERSON TABLESAMPLE kko; ===AGGREGATION=== select count( * ) from PERSON TABLESAMPLE 0.5 LIMIT 2 select count( * ) from (select NAME from PERSON limit 20) ===SUB QUERY=== select * from (select / +NO_INDEX / * from PERSON tablesample 0.1 where Name > 'tina10') where ADDRESS = 'CA' ===JOINS=== select * from PERSON1, PERSON2 tablesample 0.7 where PERSON1.Name = PERSON2.NAME ===QUERY being OPTMIZED=== select * from PERSON tablesample 0.8 (goes to IDX_ADDRESS_PERSON index table, table sample carry on) ===INSERT SELECT==== upsert into personbig(ID, ADDRESS) select id, address from personbig tablesample 0.01;
        Hide
        aertoria Ethan Wang added a comment -
        Show
        aertoria Ethan Wang added a comment - P.R link https://github.com/apache/phoenix/pull/262
        Hide
        julianhyde Julian Hyde added a comment -

        Since Calcite already supports TABLESAMPLE let's save ourselves a headache and make sure that the 4.x syntax is compatible with Calcite's syntax.

        Show
        julianhyde Julian Hyde added a comment - Since Calcite already supports TABLESAMPLE let's save ourselves a headache and make sure that the 4.x syntax is compatible with Calcite's syntax.
        Hide
        jamestaylor James Taylor added a comment -

        +1. Do you have something you can point us to for the Calcite TABLESAMPLE syntax?

        Show
        jamestaylor James Taylor added a comment - +1. Do you have something you can point us to for the Calcite TABLESAMPLE syntax?
        Hide
        aertoria Ethan Wang added a comment -

        +1. After some study about calcite/parse.jj and calcite/SqlValidatorFeatureTest.java, my understanding is that calcite seems to be very close to Postgres TABLESAMPLE syntax (which PHOENIX-153 is also designed to be similar with).

        I'd like to sum up two differences below (please correct me if I'm mistaken Julian Hyde).

        1, Calcite table sampling rate input is 0 to 100 (PHOENIX-153 currently is 0 to 1).
        2, Syntax difference
        Calcite: select name from dept TABLESAMPLE system(58)
        PHOENIX-153: select name from dept TABLESAMPLE 0.58

        Purposing change for PHOENIX-153: Let's change phoenix side to be
        select name from dept TABLESAMPLE(0.58)

        Thoughts?

        Reference:
        https://github.com/apache/calcite/blob/d619304070bf2874ab760c92ec2573ee6c19f536/piglet/src/main/javacc/PigletParser.jj

        https://github.com/apache/calcite/blob/0938c7b6d767e3242874d87a30d9112512d9243a/core/src/test/java/org/apache/calcite/test/SqlValidatorFeatureTest.java

        Show
        aertoria Ethan Wang added a comment - +1. After some study about calcite/parse.jj and calcite/SqlValidatorFeatureTest.java , my understanding is that calcite seems to be very close to Postgres TABLESAMPLE syntax (which PHOENIX-153 is also designed to be similar with). I'd like to sum up two differences below (please correct me if I'm mistaken Julian Hyde ). 1, Calcite table sampling rate input is 0 to 100 ( PHOENIX-153 currently is 0 to 1). 2, Syntax difference Calcite: select name from dept TABLESAMPLE system(58) PHOENIX-153 : select name from dept TABLESAMPLE 0.58 Purposing change for PHOENIX-153 : Let's change phoenix side to be select name from dept TABLESAMPLE(0.58) Thoughts? Reference: https://github.com/apache/calcite/blob/d619304070bf2874ab760c92ec2573ee6c19f536/piglet/src/main/javacc/PigletParser.jj https://github.com/apache/calcite/blob/0938c7b6d767e3242874d87a30d9112512d9243a/core/src/test/java/org/apache/calcite/test/SqlValidatorFeatureTest.java

          People

          • Assignee:
            aertoria Ethan Wang
            Reporter:
            jamestaylor James Taylor
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development