Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-660

Make rand() more non-deterministic

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • Impala 1.2.1
    • None
    • Frontend

    Description

      We document that rand() returns unpredictable values, unless preceded by a call to rand(seed). My expectation from other DBMSes like MySQL and Oracle, or by inferring from POSIX behavior, is that once the random number generator is seeded, there would be an infinite stream of random values that would stretch across multiple queries. However, in practice, Impala rand() resets to the same sequence after each query:

      [localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 as int);
      Query: select rand()*34000 from store limit cast (rand()*34000 as int)
      -------------------

      rand() * 34000.0

      -------------------

      16.03013650329324
      20046.04365399389
      15068.46292087271
      2513.48005631635
      15713.13279760682
      22709.15090977641
      28398.60192704881
      5477.171718830788
      16371.07374722654
      18740.27370882233
      16465.31711354168
      13467.14456540865

      -------------------
      Returned 12 row(s) in 0.24s
      [localhost:21000] > select rand()*34000 from store limit cast (rand()*34000 as int);
      Query: select rand()*34000 from store limit cast (rand()*34000 as int)
      -------------------

      rand() * 34000.0

      -------------------

      16.03013650329324
      20046.04365399389
      15068.46292087271
      2513.48005631635
      15713.13279760682
      22709.15090977641
      28398.60192704881
      5477.171718830788
      16371.07374722654
      18740.27370882233
      16465.31711354168
      13467.14456540865

      -------------------
      Returned 12 row(s) in 0.22s

      And if rand() is called multiple times in the same query, it gives the same value each time:

      [localhost:21000] > select rand(), rand(), rand() from store;
      Query: select rand(), rand(), rand() from store
      -----------------------------------------------------------------

      rand() rand() rand()

      -----------------------------------------------------------------

      0.0004714746030380365 0.0004714746030380365 0.0004714746030380365
      0.5895895192351144 0.5895895192351144 0.5895895192351144
      0.4431900859080209 0.4431900859080209 0.4431900859080209
      0.0739258840093044 0.0739258840093044 0.0739258840093044
      0.4621509646354946 0.4621509646354946 0.4621509646354946
      0.6679162032287178 0.6679162032287178 0.6679162032287178
      0.8352529978543767 0.8352529978543767 0.8352529978543767
      0.1610932858479644 0.1610932858479644 0.1610932858479644
      0.4815021690360746 0.4815021690360746 0.4815021690360746
      0.5511845208477156 0.5511845208477156 0.5511845208477156
      0.4842740327512259 0.4842740327512259 0.4842740327512259
      0.3960924872179015 0.3960924872179015 0.3960924872179015

      -----------------------------------------------------------------
      Returned 12 row(s) in 0.23s

      What I was expecting to happen was:

      select rand(12345);
      select rand() from t1 limit 100;
      ... 100 random values ...
      select rand() from t1 limit 100;
      ... 100 different random values ...

      select rand(), rand(), rand();
      ... 3 different random values ...

      select rand(12345);
      – Then the sequence of rand() queries as above would give the same results as before.

      Otherwise, calling rand(seed) in a standalone query is kind of a no-op, it has no effect on subsequent queries:

      [localhost:21000] > select rand(12345);
      Query: select rand(12345)
      --------------------

      rand(12345)

      --------------------

      0.4827902789613187

      --------------------
      Returned 1 row(s) in 0.11s
      [localhost:21000] > select rand(), rand(), rand();
      Query: select rand(), rand(), rand()
      -----------------------------------------------------------------

      rand() rand() rand()

      -----------------------------------------------------------------

      0.0004714746030380365 0.0004714746030380365 0.0004714746030380365

      -----------------------------------------------------------------
      Returned 1 row(s) in 0.11s
      [localhost:21000] > select rand(23456);
      Query: select rand(23456)
      --------------------

      rand(23456)

      --------------------

      0.2273406601638257

      --------------------
      Returned 1 row(s) in 0.11s
      [localhost:21000] > select rand(), rand(), rand();
      Query: select rand(), rand(), rand()
      -----------------------------------------------------------------

      rand() rand() rand()

      -----------------------------------------------------------------

      0.0004714746030380365 0.0004714746030380365 0.0004714746030380365

      -----------------------------------------------------------------
      Returned 1 row(s) in 0.12s

      I hope what I'm asking for is not in contradiction to IMPALA-397, which seems to be asking for a rand()-based value to be the same when referenced multiple times through a query alias.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jrussell John Russell
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: