Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6117

Parallelization of sub execution plan could make incorrectness

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.10.0
    • None
    • Frontend
    • ghx-label-8

    Description

      Symptom:

      I realized that unexpected behavior from rand(...) happened when 'create as select' or aggregation with rand function. Let's see the examples:

      • On Impala:
        > select rand(1) from t1;
        +--------------------+
        | rand(1)            |
        +--------------------+
        | 0.2219843274084778 |
        | 0.3161931793746507 |
        | 0.2793945173171323 |
        | 0.3648608677856908 |
        | 0.4869666437092082 |
        +--------------------+
        > create table t2 as select rand(1) from t1;
        +-------------------+
        | summary           |
        +-------------------+
        | Inserted 5 row(s) |
        +-------------------+
        > select * from t2;
        +--------------------+
        | _c0                |
        +--------------------+
        | 0.2219843274084778 |
        | 0.2219843274084778 |
        | 0.2219843274084778 |
        | 0.2219843274084778 |
        | 0.2219843274084778 |
        +--------------------+
        > select count(*), rand(1) from t1 group by rand(1);
        +----------+--------------------+
        | count(*) | rand(1)            |
        +----------+--------------------+
        | 5        | 0.2219843274084778 |
        +----------+--------------------+
        
      • On PostgreSQL:
        # select setseed(0.1);
        # select random() from t1;
               random
        --------------------
          0.727818949148059
         0.0379444309510291
          0.314393464010209
          0.900541861541569
          0.918851081747562
        # select setseed(0.1);
        # create table t2 as select random() from t1;
        SELECT 5
        # select * from t2;
               random
        --------------------
          0.727818949148059
         0.0379444309510291
          0.314393464010209
          0.900541861541569
          0.918851081747562
        # select setseed(0.1);
        # select random() from t1 group by random();
               random
        --------------------
          0.918851081747562
          0.727818949148059
          0.900541861541569
         0.0379444309510291
          0.314393464010209
        
      • On MariaDB:
        > select rand(1) from t1;
        +---------------------+
        | rand(1)             |
        +---------------------+
        | 0.40540353712197724 |
        |  0.8716141803857071 |
        |  0.1418603212962489 |
        | 0.09445909605776807 |
        | 0.04671454713373868 |
        +---------------------+
        > create table t2 as select rand(1) from t1;
        > select * from t2;
        +---------------------+
        | rand(1)             |
        +---------------------+
        | 0.40540353712197724 |
        |  0.8716141803857071 |
        |  0.1418603212962489 |
        | 0.09445909605776807 |
        | 0.04671454713373868 |
        +---------------------+
        > select rand(1) from t2 group by rand(1);
        +---------------------+
        | rand(1)             |
        +---------------------+
        | 0.04671454713373868 |
        | 0.09445909605776807 |
        |  0.1418603212962489 |
        | 0.40540353712197724 |
        |  0.8716141803857071 |
        +---------------------+
        

      Cause:

      Current implementation for random expression does not consider parallelization of sub execution plans. Intermediate results are pulled up and then the results are consumed on each query executor. The following processing happens in each executor:

      1) Scalar expression evaluator creates an object for FunctionContext
      2) In preparation phase of random expression, issue a local storage to keep a seed value and
      3) Generate a random value repeatedly
      4) Clean-up phase of random expression

      A developer should be aware of the scope of FunctionContext. It cannot keep any shareable value during expression evaluation if a query plan is distributable.

      Solution:

      My initial idea is generating a non-distributed (sub) plan if rand function exists. It promises a consistent random sequence based on a given seed value or not, but the performance issue might happen. If I choose one between correctness and performance, I always choose an aspect of correctness.

      I believe current behavior makes wrong result issue as I mentioned above. Please share a better idea if you would have.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jinchul Jin Chul Kim
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: