Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4586

Values of non-deterministic UDFs are cached in backend

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:

      Description

      This increases the severity of a pre-existing problem, where UDFs are always assumed to be deterministics, so UDFs with only constant arguments were cached or constant folded. In most cases in Impala 2.7.0, this had no effect, e.g. both f and g in f() + g() were re-evaluated for each input row.

      The below commit added caching of constant arguments to ScalarFnCall expressions (used for UDFs, builtin functions and various operators), so f() and g() would not be re-evaluated for each input row.

        commit 10fa472fa6aa036be02748ae54daed1722449c68
        Author: Tim Armstrong <tarmstrong@cloudera.com>
        Date:   Wed Oct 26 10:55:23 2016 -0700
      
            IMPALA-4302,IMPALA-2379: constant expr arg fixes
      

      The ideal solution is to provide syntax for UDF declarations that specifies whether it is deterministic. As a short-term workaround we could add a query option that assumes that all UDFs are non-deterministic.

        Activity

        Hide
        tarmstrong Tim Armstrong added a comment -

        Alexander Behm What do you think about adding a query option like assume_udfs_deterministic?

        Or we could just assume that all UDFs are non-deterministic by default, which would limit constant folding but be safer.

        It still doesn't give exactly the old behaviour, which is that they are constant for partition pruning but non-constant otherwise. It's conceivable that someone is relying on that behaviour, but I'm unsure if it makes sense to add that quirk back in, since it would complicate things.

        Show
        tarmstrong Tim Armstrong added a comment - Alexander Behm What do you think about adding a query option like assume_udfs_deterministic? Or we could just assume that all UDFs are non-deterministic by default, which would limit constant folding but be safer. It still doesn't give exactly the old behaviour, which is that they are constant for partition pruning but non-constant otherwise. It's conceivable that someone is relying on that behaviour, but I'm unsure if it makes sense to add that quirk back in, since it would complicate things.
        Hide
        tarmstrong Tim Armstrong added a comment -

        I talked with Alexander Behm and Dan Hecht and we agreed on an alternate solution: we should rely on the frontend to do the constant folding, and only cache the values of literal expressions in the backend.

        Show
        tarmstrong Tim Armstrong added a comment - I talked with Alexander Behm and Dan Hecht and we agreed on an alternate solution: we should rely on the frontend to do the constant folding, and only cache the values of literal expressions in the backend.
        Hide
        tarmstrong Tim Armstrong added a comment -

        Change subject: IMPALA-4586: don't constant fold in backend
        ......................................................................

        IMPALA-4586: don't constant fold in backend

        This patch ensures that setting the query option
        enable_expr_rewrites=false will disable both constant folding in the
        frontend (which it did already) and constant caching in the backend
        (which is enabled in this patch). This gives a way for users to revert
        to the old behaviour of non-deterministic UDFs before these
        optimisations were added in Impala 2.8.

        Before this patch, the backend would cache values based on IsConstant().
        This meant that there was no way to override caching of values of
        non-deterministic UDFs, e.g. with enable_expr_rewrites.

        After this patch, we only cache literal values in the backend. This
        offers the same performance as before in the common case where the
        frontend will constant fold the expressions anyway.

        Also rename some functions to more cleanly separate the backend concepts
        of "constant" expressions and expressions that can be evaluated without
        a TupleRow. In a future change (IMPALA-4617) we should remove the
        IsConstant() analysis logic from the backend entirely and pass the
        information from the frontend. We should also fix isConstant() in the
        frontend so that it only returns true when it is safe to constant-fold
        the expression (IMPALA-4606). Once that is done, we could revert back
        to using IsConstant() instead of IsLiteral().

        Testing:
        Added targeted test to test constant folding of UDFs: we expect
        different results depending on whether constant folding is enabled.

        Also run TestUdfs with expr rewrites enabled and disabled, since this
        can exercise different code paths. Refactored test_udfs somewhat to
        avoid running uninteresting combinations of query options for
        targeted tests and removed some 'drop * if not exists' statements
        that aren't necessary when using unique_database.

        This change revealed flakiness in test_mem_limit, which seems
        to have only worked by coincidence. Updated TrackAllocation() to
        actually set the query status when a memory limit is exceeded.
        Looped this test for a while to make sure it isn't flaky any
        more.

        Also fix other test bugs where the vector argument is modified
        in-place, which can leak out to other tests.

        Change-Id: I0c76e3c8a8d92749256c312080ecd7aac5d99ce7
        Reviewed-on: http://gerrit.cloudera.org:8080/5391
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - Change subject: IMPALA-4586 : don't constant fold in backend ...................................................................... IMPALA-4586 : don't constant fold in backend This patch ensures that setting the query option enable_expr_rewrites=false will disable both constant folding in the frontend (which it did already) and constant caching in the backend (which is enabled in this patch). This gives a way for users to revert to the old behaviour of non-deterministic UDFs before these optimisations were added in Impala 2.8. Before this patch, the backend would cache values based on IsConstant(). This meant that there was no way to override caching of values of non-deterministic UDFs, e.g. with enable_expr_rewrites. After this patch, we only cache literal values in the backend. This offers the same performance as before in the common case where the frontend will constant fold the expressions anyway. Also rename some functions to more cleanly separate the backend concepts of "constant" expressions and expressions that can be evaluated without a TupleRow. In a future change ( IMPALA-4617 ) we should remove the IsConstant() analysis logic from the backend entirely and pass the information from the frontend. We should also fix isConstant() in the frontend so that it only returns true when it is safe to constant-fold the expression ( IMPALA-4606 ). Once that is done, we could revert back to using IsConstant() instead of IsLiteral(). Testing: Added targeted test to test constant folding of UDFs: we expect different results depending on whether constant folding is enabled. Also run TestUdfs with expr rewrites enabled and disabled, since this can exercise different code paths. Refactored test_udfs somewhat to avoid running uninteresting combinations of query options for targeted tests and removed some 'drop * if not exists' statements that aren't necessary when using unique_database. This change revealed flakiness in test_mem_limit, which seems to have only worked by coincidence. Updated TrackAllocation() to actually set the query status when a memory limit is exceeded. Looped this test for a while to make sure it isn't flaky any more. Also fix other test bugs where the vector argument is modified in-place, which can leak out to other tests. Change-Id: I0c76e3c8a8d92749256c312080ecd7aac5d99ce7 Reviewed-on: http://gerrit.cloudera.org:8080/5391 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            tarmstrong Tim Armstrong
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development