[DRILL-5487] Vector corruption in CSV with headers and truncated last row - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.10.0
Fix Version/s: 1.17.0
Component/s: Storage - Text & CSV
Labels:
None

Description

The CSV format plugin allows two ways of reading data:

As named columns
As a single array, called columns, that holds all columns for a row

The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns.

To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form:

h,u
abc,def
ghi

Note that the file is truncated: the command and second field is missing on the last line.

Then, I created a simple test using the "cluster fixture" framework:

  @Test
  public void readerTest() throws Exception {
    FixtureBuilder builder = ClusterFixture.builder()
        .maxParallelization(1);

    try (ClusterFixture cluster = builder.build();
         ClientFixture client = cluster.clientFixture()) {
      TextFormatConfig csvFormat = new TextFormatConfig();
      csvFormat.fieldDelimiter = ',';
      csvFormat.skipFirstLine = false;
      csvFormat.extractHeader = true;
      cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
      String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
      client.queryBuilder().sql(sql).printCsv();
    }
  }

The results show we've got a problem:

Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
IllegalArgumentException: length: -3 (expected: >= 0)

If the last line were:

efg,

Then the offset vector should look like this:

[0, 3, 3]

Very likely we have an offset vector that looks like this instead:

[0, 3, 0]

When we compute the second column of the second row, we should compute:

length = offset[2] - offset[1] = 3 - 3 = 0

Instead we get:

length = offset[2] - offset[1] = 0 - 3 = -3

The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/May/17 21:10

Updated:: 25/Jun/19 13:33

Resolved:: 25/Jun/19 13:33