[DRILL-5551] `columns` changes meaning for CSV files depending on query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.10.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Drill's CSV column reader supports two forms of files:

Files with column headers as the first line of the file.
Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via that storage plugin config.

Suppose we have a CSV file with headers:

a,b,c
10,foo,bar

Suppose we configure a storage plugin to use headers:

    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;

(The above can also be done using JSON when running Drill as a server.)

Suppose we execute this query:

SELECT columns FROM `dfs.data.example.csv`

The result is a single column, the special columns array, that contains all three fields.

Suppose we alter the query just a bit:

SELECT columns, a FROM `dfs.data.example.csv`

Now the result set is two non-nullable Varchar columns:

columns,a
,10

It seems that the meaning of `columns` shifts depending on whether the value appears by itself or as part of a SELECT list.

Perhaps this handles the case of a file such as:

columns,values
a;b,10;10
c;d,20;30

That is fine. but what if I just wanted the first column:

SELECT columns FROM `dfs.data.strange.csv`

How would the code know if columns was the special column vs. the normal column called "columns"?

Perhaps one long-term solution is to make columns into a table function (as has been proposed for the implicit columns):

SELECT columns(t) FROM `dfs.data.strange.csv` AS t

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/May/17 22:28

Updated:: 29/May/17 22:28