Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6154

NaN, Infinity issues

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13.0
    • 1.13.0
    • None

    Description

      1.Issue

      AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

      AFFECTED_FUNCTIONS:

      • sqrt
      • trunc

      ISSUE_DESCRIPTION: According to DRILL-5919/ new json number literals were added: NaN, Infinity, -Infinity. The new data types must be processed properly by existing functions. There are a few issues:
      1. SQRT function. Run the following test query:

      select sqrt(Nan) NaN, sqrt(Positive_Infinity) POS_INF, sqrt(Negative_Infinity) NEG_INF from dfs.tmp.`PN_Inf_NaN.json`

       

      • EXPECTED_RESULT: it was expected to get the following result: NaN, Infinity, NaN (expected result is based on java Math.sqrt() method)
      • ACTUAL_RESULT: the test query returned: NaN, Infinity, Infinity

      2. TRUNC function. According to DRILL docs (https://drill.apache.org/docs/math-and-trig/): TRUNC(x, y) : Truncates x to y decimal places. Specifying y is optional. Default is 1. So, the function must work properly without specifying y
      However an error message appears. Run test_query:

      select trunc(Nan) NaN, trunc(Positive_Infinity) POS_INF, trunc(Negative_Infinity) NEG_INF from dfs.tmp.`PN_Inf_NaN.json`
      • EXPECTED_RESULT: it was expected to get the following result NaN, NaN, NaN
      • ACTUAL_RESULT: it appears the following error message: Query Failed: An Error Occurred org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: NumberFormatException Fragment 0:0 [Error Id: 95e01fee-7433-4b0b-b913-32358b4a8f55 on node1:31010]

      Please investigate and fix, test file attached PN_Inf_NaN.json

      2. Issue
      AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

      AFFECTED_FUNCTIONALITY: INNER JOIN

      ISSUE_DESCRIPTION: There were added new Json data types in DRILL-5919: NaN, Infinity, -Infinity.
      During testing activities, it was detected a bit strange behavior of INNER JOIN operator - different query results in almost the same queries.
      Query1

       select distinct t.name, tt.name from dfs.tmp.`ObjsX.json` t inner join dfs.tmp.`ObjsX.json` tt on t.attr4 = tt.attr4 

      Query2

       select distinct t.name from dfs.tmp.`ObjsX.json` t inner join dfs.tmp.`ObjsX.json` tt on t.attr4 = tt.attr4 

      Query1 differs from Query2 by 1 columns only:

      • In Query1 - 2 columns are selected - t.name, tt.name
      • In Query2 - 1 column is selected - t.name

      However Query1/Query2 return completely different results:

      • Query1 returns
        	name         name0
        	object2 	object2
        	object2 	object3
        	object2 	object4
        	object3 	object2
        	object3 	object3
        	object3 	object4
        	object4 	object2
        	object4 	object3
        	object4 	object4
        	

        This result seems to be correct.

      • Query2 returns No result found, not expected:
        EXPECTED_RESULT:
        	name
        	object2
        	object3
        	object4
        	

      ACTUAL_RESULT

      No result found

      NB!: the issue appears only if tables are JOINed by a column which contains newly-added data types (NaN, Infinity, -Infinity). The issue is not reproducible is a user is JOINing tables by a column containing other data types

      3. Issue
      AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

      AFFECTED_FUNCTIONALITY: ORDER BY, DESC

      THIS ISSUE REFERS TO:DRILL-5919

      ISSUE_DESCRIPTION: 'ORDER BY/DESC' clause behaves in different ways when sorting columns containing NaN values. In one case it considers NaN to be the largest value, in another - the smallest one.
      Steps:

      • Select from the attached test file (orderBy.json, attached)
        SELECT name, attr4 from dfs.tmp.`orderBy.json` order by name, attr4
      • Check the attached screen shot (orderByIssue.jpg):
        EXPECTED_RESULT: It was expected the 'ORDER BY' clause to sort attr4 columns data in the same way (most probably NaN should be the largest, see NB)
        ACTUAL_RESULT: attr4 column's values were sorted in different ways: for 'obj1'/'obj3' NaN is the largest, for 'obj2'/'obj4' NaN is the smallest.

      NB: Postgres as well as Java's sorting (Collection.sort() / Arrays.sort() methods) treats NaN as the largest value

      4. Issue
      AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

      AFFECTED_FUNCTIONALITY: max(column_name), min(column_name),

      ISSUE_DESCRIPTION: min/max aggregation functions return the same result if the selected column contains NaN value.

      SELECT name, max(attr4), min(attr4) from dfs.tmp.`minMax.json` group by name

      Result

      name	Min	Max
      obj1	NaN	NaN
      obj2	NaN	NaN
      obj3	NaN	NaN
      obj4	NaN	NaN
      

      As for me, this logic should be revised, current behavior is a bit confusing:

      • Postgres considers NaN to be the largest value, so MAX(col_withNaN) will return NaN, MIN(col_withNaN) - will return other value, so DRILL's min/max logic can be adjusted to the Postgres' one, see postgres.jpg
      • Or NAN can behave like NULL - DRILL's MIN/MAX functions ignore NULLs

      Attachments

        Issue Links

          Activity

            People

              volodymyr.tkach Volodymyr Tkach
              volodymyr.tkach Volodymyr Tkach
              Arina Ielchiieva Arina Ielchiieva
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: