Details

      Description

      The cause of the issue below is that r is not materialized.

      select id, name, rand() r from countries order by r limit 10;
      Query: select id, name, rand() r from countries order by r limit 10
      Query finished, fetching results ...
      +----+----------------+-----------------------+
      | id | name           | r                     |
      +----+----------------+-----------------------+
      | 3  | Canada         | 0.0004714746030380365 |
      | 5  | Australia      | 0.5895895192351144    |
      | 1  | United States  | 0.4431900859080209    |
      | 4  | Ireland        | 0.0739258840093044    |
      | 6  | Netherlands    | 0.4621509646354946    |
      | 2  | United Kingdom | 0.6679162032287178    |
      | 9  | France         | 0.8352529978543767    |
      | 8  | Germany        | 0.1610932858479644    |
      | 7  | New Zealand    | 0.4815021690360746    |
      | 91 | Antigua        | 0.5511845208477156    |
      +----+----------------+-----------------------+
      Returned 10 row(s) in 0.48s
      

        Issue Links

          Activity

          Hide
          twmarshall Thomas Tauber-Marshall added a comment -

          commit 6cddb952cefedd373b2a1ce71a1b3cff2e774d70
          Author: Thomas Tauber-Marshall <tmarshall@cloudera.com>
          Date: Tue Jan 31 10:33:07 2017 -0800

          IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs

          Previously, exprs used in sorts were evaluated lazily. This can
          potentially be bad for performance if the exprs are expensive to
          evaluate, and it can lead to crashes if the exprs are
          non-deterministic, as this violates assumptions of our sorting
          algorithm.

          This patch addresses these issues by materializing ordering exprs.
          It does so when the expr is non-deterministic (including when it
          contains a UDF, which we cannot currently know if they are
          non-deterministic), or when its cost exceeds a threshold (or the
          cost is unknown).

          Testing:

          • Added e2e tests in test_sort.py.
          • Updated planner tests.

          Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
          Reviewed-on: http://gerrit.cloudera.org:8080/6322
          Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          twmarshall Thomas Tauber-Marshall added a comment - commit 6cddb952cefedd373b2a1ce71a1b3cff2e774d70 Author: Thomas Tauber-Marshall <tmarshall@cloudera.com> Date: Tue Jan 31 10:33:07 2017 -0800 IMPALA-4731 / IMPALA-397 / IMPALA-4728 : Materialize sort exprs Previously, exprs used in sorts were evaluated lazily. This can potentially be bad for performance if the exprs are expensive to evaluate, and it can lead to crashes if the exprs are non-deterministic, as this violates assumptions of our sorting algorithm. This patch addresses these issues by materializing ordering exprs. It does so when the expr is non-deterministic (including when it contains a UDF, which we cannot currently know if they are non-deterministic), or when its cost exceeds a threshold (or the cost is unknown). Testing: Added e2e tests in test_sort.py. Updated planner tests. Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e Reviewed-on: http://gerrit.cloudera.org:8080/6322 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          srus Silvius Rus added a comment -

          Correctness issues are generally blockers. While this is a correctness issue, given the low likelihood it will not be tracked as a blocker.

          Show
          srus Silvius Rus added a comment - Correctness issues are generally blockers. While this is a correctness issue, given the low likelihood it will not be tracked as a blocker.
          Hide
          jamiet_impala_2624 Jamie Thomson added a comment -

          Similar experience for us. We'd like to use rand() to produce a random sample of the dataset, e.g. to get 90% of the data:

          select *
          from (
            select rand() as a 
            from <table>
          )q
          where q.a < 0.9

          But it just returns 100% of the data in this case. Very bizarre. I asked about the issue on SO also: http://stackoverflow.com/questions/33780640/impala-behaviour-when-using-rand-is-strange

          Show
          jamiet_impala_2624 Jamie Thomson added a comment - Similar experience for us. We'd like to use rand() to produce a random sample of the dataset, e.g. to get 90% of the data: select * from ( select rand() as a from <table> )q where q.a < 0.9 But it just returns 100% of the data in this case. Very bizarre. I asked about the issue on SO also: http://stackoverflow.com/questions/33780640/impala-behaviour-when-using-rand-is-strange
          Hide
          serega_sheypak Sergey added a comment -

          The same for us.
          We've tried to put rand() into inner query (cheat optimizer) and reref it in outer query in order by clause.
          Doesn't help.

          Show
          serega_sheypak Sergey added a comment - The same for us. We've tried to put rand() into inner query (cheat optimizer) and reref it in outer query in order by clause. Doesn't help.

            People

            • Assignee:
              twmarshall Thomas Tauber-Marshall
              Reporter:
              alex.behm Alexander Behm
            • Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development