[DRILL-5548] SELECT * against an empty CSV file with headers produces error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.10.0
Fix Version/s: None
Component/s: Storage - Text & CSV
Labels:
None

Description

Drill's CSV column reader supports two forms of files:

Files with column headers as the first line of the file.
Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via that storage plugin config.

Suppose we have a empty file. When queried in the CSV configuration without headers, the query works. The schema returned is the columns Varchar array, and the results contain no rows. Good.

Now, query the same file with the CSV plugin configured to use headers.

    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;

(The above can also be done using JSON when running Drill as a server.)

Query the file with the following query:

SELECT * FROM `dfs.data.empty.csv`

We get the following exception:

org.apache.drill.common.exceptions.UserRemoteException: 
SYSTEM ERROR: IllegalStateException: 
Incoming batch [#4, ProjectRecordBatch] has an empty schema. 
This is not allowed.

This particular case is a bit tricky. First, we want headers, but there are none. We can interpret this as an error (a file with headers must have headers). Or, we an treat it as a file that happens to have no columns. The latter choice is a bit more general.

The file also has no data rows. This could be an error, or it too could just be treated as a result set of zero rows.

Combined, the result set is one with no columns and no rows: an empty result set. This is actually a valid (if not very useful) result in SQL.

Conversation with Jinfeng suggested that, in such a scenario, the reader is supposed to make up a dummy column so that the result is not empty. While this is a workaround, it seems to just push the problem from the Project operator into each of many record readers.

Another alternative is to revert to the columns column: generate a result set with the columns array, but with no data. This solution avoids the empty batch problem.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/May/17 21:13

Updated:: 06/Jul/17 06:43