Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 1.2.1
-
None
Description
We document that rand() returns unpredictable values, unless preceded by a call to rand(seed). My expectation from other DBMSes like MySQL and Oracle, or by inferring from POSIX behavior, is that once the random number generator is seeded, there would be an infinite stream of random values that would stretch across multiple queries. However, in practice, Impala rand() resets to the same sequence after each query:
[localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 as int);
Query: select rand()*34000 from store limit cast (rand()*34000 as int)
-------------------
rand() * 34000.0 |
-------------------
16.03013650329324 |
20046.04365399389 |
15068.46292087271 |
2513.48005631635 |
15713.13279760682 |
22709.15090977641 |
28398.60192704881 |
5477.171718830788 |
16371.07374722654 |
18740.27370882233 |
16465.31711354168 |
13467.14456540865 |
-------------------
Returned 12 row(s) in 0.24s
[localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 as int);
Query: select rand()*34000 from store limit cast (rand()*34000 as int)
-------------------
rand() * 34000.0 |
-------------------
16.03013650329324 |
20046.04365399389 |
15068.46292087271 |
2513.48005631635 |
15713.13279760682 |
22709.15090977641 |
28398.60192704881 |
5477.171718830788 |
16371.07374722654 |
18740.27370882233 |
16465.31711354168 |
13467.14456540865 |
-------------------
Returned 12 row(s) in 0.22s
And if rand() is called multiple times in the same query, it gives the same value each time:
[localhost:21000] > select rand(), rand(), rand() from store;
Query: select rand(), rand(), rand() from store
-----------------------------------------------------------------
rand() | rand() | rand() |
-----------------------------------------------------------------
0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 |
0.5895895192351144 | 0.5895895192351144 | 0.5895895192351144 |
0.4431900859080209 | 0.4431900859080209 | 0.4431900859080209 |
0.0739258840093044 | 0.0739258840093044 | 0.0739258840093044 |
0.4621509646354946 | 0.4621509646354946 | 0.4621509646354946 |
0.6679162032287178 | 0.6679162032287178 | 0.6679162032287178 |
0.8352529978543767 | 0.8352529978543767 | 0.8352529978543767 |
0.1610932858479644 | 0.1610932858479644 | 0.1610932858479644 |
0.4815021690360746 | 0.4815021690360746 | 0.4815021690360746 |
0.5511845208477156 | 0.5511845208477156 | 0.5511845208477156 |
0.4842740327512259 | 0.4842740327512259 | 0.4842740327512259 |
0.3960924872179015 | 0.3960924872179015 | 0.3960924872179015 |
-----------------------------------------------------------------
Returned 12 row(s) in 0.23s
What I was expecting to happen was:
select rand(12345);
select rand() from t1 limit 100;
... 100 random values ...
select rand() from t1 limit 100;
... 100 different random values ...
select rand(), rand(), rand();
... 3 different random values ...
select rand(12345);
– Then the sequence of rand() queries as above would give the same results as before.
Otherwise, calling rand(seed) in a standalone query is kind of a no-op, it has no effect on subsequent queries:
[localhost:21000] > select rand(12345);
Query: select rand(12345)
--------------------
rand(12345) |
--------------------
0.4827902789613187 |
--------------------
Returned 1 row(s) in 0.11s
[localhost:21000] > select rand(), rand(), rand();
Query: select rand(), rand(), rand()
-----------------------------------------------------------------
rand() | rand() | rand() |
-----------------------------------------------------------------
0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 |
-----------------------------------------------------------------
Returned 1 row(s) in 0.11s
[localhost:21000] > select rand(23456);
Query: select rand(23456)
--------------------
rand(23456) |
--------------------
0.2273406601638257 |
--------------------
Returned 1 row(s) in 0.11s
[localhost:21000] > select rand(), rand(), rand();
Query: select rand(), rand(), rand()
-----------------------------------------------------------------
rand() | rand() | rand() |
-----------------------------------------------------------------
0.0004714746030380365 | 0.0004714746030380365 | 0.0004714746030380365 |
-----------------------------------------------------------------
Returned 1 row(s) in 0.12s
I hope what I'm asking for is not in contradiction to IMPALA-397, which seems to be asking for a rand()-based value to be the same when referenced multiple times through a query alias.