[DRILL-6154] NaN, Infinity issues - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.13.0
Fix Version/s: 1.13.0
Component/s: None
Labels:
- ready-to-commit

Description

1.Issue

AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

AFFECTED_FUNCTIONS:

sqrt
trunc

ISSUE_DESCRIPTION: According to ~~DRILL-5919~~/ new json number literals were added: NaN, Infinity, -Infinity. The new data types must be processed properly by existing functions. There are a few issues:
1. SQRT function. Run the following test query:

select sqrt(Nan) NaN, sqrt(Positive_Infinity) POS_INF, sqrt(Negative_Infinity) NEG_INF from dfs.tmp.`PN_Inf_NaN.json`

EXPECTED_RESULT: it was expected to get the following result: NaN, Infinity, NaN (expected result is based on java Math.sqrt() method)
ACTUAL_RESULT: the test query returned: NaN, Infinity, Infinity

2. TRUNC function. According to DRILL docs (https://drill.apache.org/docs/math-and-trig/): TRUNC(x, y) : Truncates x to y decimal places. Specifying y is optional. Default is 1. So, the function must work properly without specifying y
However an error message appears. Run test_query:

select trunc(Nan) NaN, trunc(Positive_Infinity) POS_INF, trunc(Negative_Infinity) NEG_INF from dfs.tmp.`PN_Inf_NaN.json`

EXPECTED_RESULT: it was expected to get the following result NaN, NaN, NaN
ACTUAL_RESULT: it appears the following error message: Query Failed: An Error Occurred org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: NumberFormatException Fragment 0:0 [Error Id: 95e01fee-7433-4b0b-b913-32358b4a8f55 on node1:31010]

Please investigate and fix, test file attached PN_Inf_NaN.json

2. Issue
AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

AFFECTED_FUNCTIONALITY: INNER JOIN

ISSUE_DESCRIPTION: There were added new Json data types in ~~DRILL-5919~~: NaN, Infinity, -Infinity.
During testing activities, it was detected a bit strange behavior of INNER JOIN operator - different query results in almost the same queries.
Query1

 select distinct t.name, tt.name from dfs.tmp.`ObjsX.json` t inner join dfs.tmp.`ObjsX.json` tt on t.attr4 = tt.attr4

Query2

 select distinct t.name from dfs.tmp.`ObjsX.json` t inner join dfs.tmp.`ObjsX.json` tt on t.attr4 = tt.attr4

Query1 differs from Query2 by 1 columns only:

In Query1 - 2 columns are selected - t.name, tt.name
In Query2 - 1 column is selected - t.name

However Query1/Query2 return completely different results:

Query1 returns

	name         name0
	object2 	object2
	object2 	object3
	object2 	object4
	object3 	object2
	object3 	object3
	object3 	object4
	object4 	object2
	object4 	object3
	object4 	object4

This result seems to be correct.

Query2 returns No result found, not expected:
EXPECTED_RESULT:
```
	name
	object2
	object3
	object4
	
```

ACTUAL_RESULT

No result found

NB!: the issue appears only if tables are JOINed by a column which contains newly-added data types (NaN, Infinity, -Infinity). The issue is not reproducible is a user is JOINing tables by a column containing other data types

3. Issue
AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

AFFECTED_FUNCTIONALITY: ORDER BY, DESC

THIS ISSUE REFERS TO:~~DRILL-5919~~

ISSUE_DESCRIPTION: 'ORDER BY/DESC' clause behaves in different ways when sorting columns containing NaN values. In one case it considers NaN to be the largest value, in another - the smallest one.
Steps:

Select from the attached test file (orderBy.json, attached)

SELECT name, attr4 from dfs.tmp.`orderBy.json` order by name, attr4

Check the attached screen shot (orderByIssue.jpg):
EXPECTED_RESULT: It was expected the 'ORDER BY' clause to sort attr4 columns data in the same way (most probably NaN should be the largest, see NB)
ACTUAL_RESULT: attr4 column's values were sorted in different ways: for 'obj1'/'obj3' NaN is the largest, for 'obj2'/'obj4' NaN is the smallest.

NB: Postgres as well as Java's sorting (Collection.sort() / Arrays.sort() methods) treats NaN as the largest value

4. Issue
AFFECTED_VERSION: drill-1.13.0-SNAPSHOT

AFFECTED_FUNCTIONALITY: max(column_name), min(column_name),

ISSUE_DESCRIPTION: min/max aggregation functions return the same result if the selected column contains NaN value.

SELECT name, max(attr4), min(attr4) from dfs.tmp.`minMax.json` group by name

Result

name	Min	Max
obj1	NaN	NaN
obj2	NaN	NaN
obj3	NaN	NaN
obj4	NaN	NaN

As for me, this logic should be revised, current behavior is a bit confusing:

Postgres considers NaN to be the largest value, so MAX(col_withNaN) will return NaN, MIN(col_withNaN) - will return other value, so DRILL's min/max logic can be adjusted to the Postgres' one, see postgres.jpg
Or NAN can behave like NULL - DRILL's MIN/MAX functions ignore NULLs

Attachments

Issue Links

links to

GitHub Pull Request #1123

Activity

People

Assignee:: Volodymyr Tkach

Reporter:: Volodymyr Tkach

Reviewer:: Arina Ielchiieva

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Feb/18 14:23

Updated:: 24/Feb/18 01:19

Resolved:: 24/Feb/18 01:19