Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.10.0
-
None
Description
The CSV format plugin allows two ways of reading data:
- As named columns
- As a single array, called columns, that holds all columns for a row
The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns.
To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form:
h,u abc,def ghi
Note that the file is truncated: the command and second field is missing on the last line.
Then, I created a simple test using the "cluster fixture" framework:
@Test public void readerTest() throws Exception { FixtureBuilder builder = ClusterFixture.builder() .maxParallelization(1); try (ClusterFixture cluster = builder.build(); ClientFixture client = cluster.clientFixture()) { TextFormatConfig csvFormat = new TextFormatConfig(); csvFormat.fieldDelimiter = ','; csvFormat.skipFirstLine = false; csvFormat.extractHeader = true; cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat); String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10"; client.queryBuilder().sql(sql).printCsv(); } }
The results show we've got a problem:
Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IllegalArgumentException: length: -3 (expected: >= 0)
If the last line were:
efg,
Then the offset vector should look like this:
[0, 3, 3]
Very likely we have an offset vector that looks like this instead:
[0, 3, 0]
When we compute the second column of the second row, we should compute:
length = offset[2] - offset[1] = 3 - 3 = 0
Instead we get:
length = offset[2] - offset[1] = 0 - 3 = -3
The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query.