Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4266

Java udf expression returning string in group by can give incorrect results.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:

      Description

      I have a simple Java UDF as follows (replaces each occurrence of 's' with 'ss').

      import org.apache.hadoop.hive.ql.exec.UDF;
      import java.text.ParseException;
      import org.apache.hadoop.io.Text;
      
      public class MyReplaceString extends UDF
      {
        public Text evaluate(Text para) throws ParseException {
          if ((null == para) || ("".equals(para.toString()))) {
            return new Text("");
          }
          return new Text(para.toString().replace("s", "ss"));
        }
      }
      
      [localhost:21000] > select * from test_replace_group_by;
      Query: select * from test_escape_group_by
      Query submitted at: 2016-10-10 09:44:33 (Coordinator: http://optimus:25000)
      Query progress can be monitored at: http://optimus:25000/query_plan?query_id=2a42edbc9837b8dc:8d3e5d1500000000
      +------------+
      | s          |
      +------------+
      | blehss     |
      | blahss     |
      | longstring |
      | short      |
      | tataehss   |
      +------------+
      Fetched 5 row(s) in 6.92s
      [localhost:21000] > 
      
      [localhost:21000] > create function my_replace_string(string) returns string location '/tmp/hive_udf_replace.jar' symbol='MyReplaceString';
      
      ------- CORRECT RESULT---------------
      [localhost:21000] > select my_replace_string(s) as es from test_replace_group_by;
      Query: select my_escape_string(s) as es from test_escape_group_by
      Query submitted at: 2016-10-10 09:46:17 (Coordinator: http://optimus:25000)
      Query progress can be monitored at: http://optimus:25000/query_plan?query_id=9149a3a3924604f9:dc97fcf200000000
      +-------------+
      | es          |
      +-------------+
      | blehssss    |
      | tataehssss  |
      | blahssss    |
      | longsstring |
      | sshort      |
      +-------------+
      Fetched 5 row(s) in 0.12s
      
      -------------- INCORRECT-------------
      [localhost:21000] > select my_replace_string(s) as es from test_replace_group_by group by es;
      Query: select my_escape_string(s) as es from test_escape_group_by group by es
      Query submitted at: 2016-10-10 09:46:24 (Coordinator: http://optimus:25000)
      Query progress can be monitored at: http://optimus:25000/query_plan?query_id=2c413bed4dccd122:cf81137600000000
      +-------------+
      | es          |
      +-------------+
      | sshorttring |  <------
      | tataehssss  |
      | sshort      |
      | blehssss    |
      | blahssss    |
      +-------------+
      Fetched 5 row(s) in 0.22s
      [localhost:21000] > 
      

      Tried disabling codegen/streaming preaggs but it didn't help.

        Issue Links

          Activity

          Hide
          bharathv bharath v added a comment -

          Looks like there is some memory corruption between partitioned-agg nodes and hive udfs in grouping exprs. For ex:

          [localhost:21000] > select * from test_groupby2;
          Query: select * from test_groupby2
          Query submitted at: 2016-10-10 19:07:21 (Coordinator: http://optimus:25000)
          Query progress can be monitored at: http://optimus:25000/query_plan?query_id=bb42112ae9cb4580:5fd5029e00000000
          +------------------+
          | a                |
          +------------------+
          | short            |
          | longstring       |
          | doublelongstring |
          +------------------+
          Fetched 3 row(s) in 0.11s
          [localhost:21000] > select my_replace_string(a) as es from test_groupby2 group by es;
          Query: select my_replace_string(a) as es from test_groupby2 group by es
          Query submitted at: 2016-10-10 19:07:31 (Coordinator: http://optimus:25000)
          Query progress can be monitored at: http://optimus:25000/query_plan?query_id=d044fa8616a25947:15cfc80a00000000
          +-------------------+
          | es                |
          +-------------------+
          | �@-               |
          | ing               |
          | doublelongsstring |
          | �u/              |
          +-------------------+
          Fetched 3 row(s) in 0.11s
          
          
          Show
          bharathv bharath v added a comment - Looks like there is some memory corruption between partitioned-agg nodes and hive udfs in grouping exprs. For ex: [localhost:21000] > select * from test_groupby2; Query: select * from test_groupby2 Query submitted at: 2016-10-10 19:07:21 (Coordinator: http://optimus:25000) Query progress can be monitored at: http://optimus:25000/query_plan?query_id=bb42112ae9cb4580:5fd5029e00000000 +------------------+ | a | +------------------+ | short | | longstring | | doublelongstring | +------------------+ Fetched 3 row(s) in 0.11s [localhost:21000] > select my_replace_string(a) as es from test_groupby2 group by es; Query: select my_replace_string(a) as es from test_groupby2 group by es Query submitted at: 2016-10-10 19:07:31 (Coordinator: http://optimus:25000) Query progress can be monitored at: http://optimus:25000/query_plan?query_id=d044fa8616a25947:15cfc80a00000000 +-------------------+ | es | +-------------------+ | �@- | | ing | | doublelongsstring | | �u/ | +-------------------+ Fetched 3 row(s) in 0.11s
          Hide
          tarmstrong Tim Armstrong added a comment -

          IMPALA-4266: Java udf returning string can give incorrect results

          The memory management of string results was wrong: strings returned from
          Exprs must live until the next time FreeLocalAllocations() is called.
          Otherwise the buffer holding the string is freed or reused by the next
          UDF call. The fix is to copy string values into a buffer with the
          right lifetime.

          Testing:
          Added a regression test based on Bharath's example that reproduced the
          bug reliably.

          Change-Id: I705d271814cb1143f67d8a12f4fd87bab7a8e161
          Reviewed-on: http://gerrit.cloudera.org:8080/4941
          Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          tarmstrong Tim Armstrong added a comment - IMPALA-4266 : Java udf returning string can give incorrect results The memory management of string results was wrong: strings returned from Exprs must live until the next time FreeLocalAllocations() is called. Otherwise the buffer holding the string is freed or reused by the next UDF call. The fix is to copy string values into a buffer with the right lifetime. Testing: Added a regression test based on Bharath's example that reproduced the bug reliably. Change-Id: I705d271814cb1143f67d8a12f4fd87bab7a8e161 Reviewed-on: http://gerrit.cloudera.org:8080/4941 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins
          Hide
          aivanov_impala_e71b Antoni added a comment -

          This bug also affects v2.7.0-cdh5.9.0 using distinct
          We just upgraded and noticed that the following query returns incorrect result:

          SELECT
          count(distinct customer.aes_decrypt(serial))
          FROM my_table
          – return 40

          while if we do
          insert into my_table_with_materialized_column select serial, customer.aes_decrypt(serial) decrypted from my_table
          the select count(distinct decrypted) from my_table_with_materialized_column
          – returns 10,000 - which is the correct result

          Any advice on how to handle/workaround this ?

          Show
          aivanov_impala_e71b Antoni added a comment - This bug also affects v2.7.0-cdh5.9.0 using distinct We just upgraded and noticed that the following query returns incorrect result: SELECT count(distinct customer.aes_decrypt(serial)) FROM my_table – return 40 while if we do insert into my_table_with_materialized_column select serial, customer.aes_decrypt(serial) decrypted from my_table the select count(distinct decrypted) from my_table_with_materialized_column – returns 10,000 - which is the correct result Any advice on how to handle/workaround this ?
          Hide
          tarmstrong Tim Armstrong added a comment -

          I agree this is a long-standing issue that was definitely present in Impala 2.7 and earlier versions

          Show
          tarmstrong Tim Armstrong added a comment - I agree this is a long-standing issue that was definitely present in Impala 2.7 and earlier versions
          Hide
          tarmstrong Tim Armstrong added a comment -

          The second query with the temporary table is the workaround that I'd suggest, since that avoids the issue entirely (since you don't have the hive UDF evaluated along with the aggregation). I'm playing around now to see if I can rewrite the query in a way to avoid the problem without the intermediate table.

          Show
          tarmstrong Tim Armstrong added a comment - The second query with the temporary table is the workaround that I'd suggest, since that avoids the issue entirely (since you don't have the hive UDF evaluated along with the aggregation). I'm playing around now to see if I can rewrite the query in a way to avoid the problem without the intermediate table.
          Hide
          aivanov_impala_e71b Antoni added a comment -

          Are there are plans to be back-ported in CDH 5.9.1 for example ?

          Show
          aivanov_impala_e71b Antoni added a comment - Are there are plans to be back-ported in CDH 5.9.1 for example ?
          Hide
          aivanov_impala_e71b Antoni added a comment -

          One way for this query is to count distinct without the UDF e.g

          SELECT count(customer.aes_decrypt(serial))
          from (select distinct serial from my_table) subtable

          But that workaround won't work for all queries.

          Show
          aivanov_impala_e71b Antoni added a comment - One way for this query is to count distinct without the UDF e.g SELECT count(customer.aes_decrypt(serial)) from (select distinct serial from my_table) subtable But that workaround won't work for all queries.
          Hide
          tarmstrong Tim Armstrong added a comment -

          You can wrap the UDF call in a concat() to avoid the memory management issue. This forces it to copy the output stream into memory with the correct lifetime. E.g. rewrite my_hive_udf() as concat(my_hive_udf(), '')

          Show
          tarmstrong Tim Armstrong added a comment - You can wrap the UDF call in a concat() to avoid the memory management issue. This forces it to copy the output stream into memory with the correct lifetime. E.g. rewrite my_hive_udf() as concat(my_hive_udf(), '')
          Hide
          tarmstrong Tim Armstrong added a comment -

          Antoni putting on my Cloudera hat for a second - yes, "blocker" and many "critical" bugfixes get backported to point releases unless there's a specific reason not to do so (e.g. the fix is too large and therefore risky).

          Show
          tarmstrong Tim Armstrong added a comment - Antoni putting on my Cloudera hat for a second - yes, "blocker" and many "critical" bugfixes get backported to point releases unless there's a specific reason not to do so (e.g. the fix is too large and therefore risky).
          Hide
          jrussell John Russell added a comment -

          Added new "known issue" item in this gerrit review: https://gerrit.cloudera.org/#/c/5809/

          Show
          jrussell John Russell added a comment - Added new "known issue" item in this gerrit review: https://gerrit.cloudera.org/#/c/5809/

            People

            • Assignee:
              tarmstrong Tim Armstrong
              Reporter:
              bharathv bharath v
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development