Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.10.0
-
None
-
None
-
None
Description
Drill's CSV column reader supports two forms of files:
- Files with column headers as the first line of the file.
- Files without column headers.
The CSV storage plugin specifies which format to use for files accessed via that storage plugin config.
Suppose we have a CSV file with headers:
a,b,c 10,foo,bar
Suppose we configure a storage plugin to use headers:
TextFormatConfig csvFormat = new TextFormatConfig(); csvFormat.fieldDelimiter = ','; csvFormat.skipFirstLine = false; csvFormat.extractHeader = true;
(The above can also be done using JSON when running Drill as a server.)
Suppose we execute this query:
SELECT columns FROM `dfs.data.example.csv`
The result is a single column, the special columns array, that contains all three fields.
Suppose we alter the query just a bit:
SELECT columns, a FROM `dfs.data.example.csv`
Now the result set is two non-nullable Varchar columns:
columns,a ,10
It seems that the meaning of `columns` shifts depending on whether the value appears by itself or as part of a SELECT list.
Perhaps this handles the case of a file such as:
columns,values a;b,10;10 c;d,20;30
That is fine. but what if I just wanted the first column:
SELECT columns FROM `dfs.data.strange.csv`
How would the code know if columns was the special column vs. the normal column called "columns"?
Perhaps one long-term solution is to make columns into a table function (as has been proposed for the implicit columns):
SELECT columns(t) FROM `dfs.data.strange.csv` AS t