[DRILL-4614] Drill must appoint one data type per one column for self-describing data while querying directories - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: Execution - Data Types
Labels:
None

Description

While drill selects data from the directory and detects data types on-the-fly
it is possible that one field will be of several data types .

For example:

1. Create an input file as follows
20K rows with the following -
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
200 rows with the following -
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"}}

2. CTAS as follows

CREATE TABLE dfs.`tmp`.`tp` as select * from dfs.`data.json` t

In this case will be created parquet table as the folder with two files.

3. Select the data

select t.others.additional from dfs.`tmp`.`tp` t

The result of selecting will be mix of EXPR$0<INT(OPTIONAL)> and EXPR$0<VARCHAR(OPTIONAL)>.

It happens because Drill defines column data type per file.
The same result with json files.
Since streaming aggregate does not support schema changes this issue makes impossible of using aggregate functions with query results.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

data.json
18/Apr/16 16:57
1.45 MB
Vitalii Diravka

Issue Links

is related to

DRILL-3806 add metadata for untyped null and simple type promotion

Open

relates to

DRILL-3577 Counting nested fields on CTAS-created-parquet file/s reports inaccurate results

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Vitalii Diravka

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Apr/16 16:49

Updated:: 20/Apr/16 11:17

Resolved:: 20/Apr/16 11:17