Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5487

Vector corruption in CSV with headers and truncated last row

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10.0
    • Fix Version/s: 1.17.0
    • Component/s: Storage - Text & CSV
    • Labels:
      None

      Description

      The CSV format plugin allows two ways of reading data:

      • As named columns
      • As a single array, called columns, that holds all columns for a row

      The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns.

      To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form:

      h,u
      abc,def
      ghi
      

      Note that the file is truncated: the command and second field is missing on the last line.

      Then, I created a simple test using the "cluster fixture" framework:

        @Test
        public void readerTest() throws Exception {
          FixtureBuilder builder = ClusterFixture.builder()
              .maxParallelization(1);
      
          try (ClusterFixture cluster = builder.build();
               ClientFixture client = cluster.clientFixture()) {
            TextFormatConfig csvFormat = new TextFormatConfig();
            csvFormat.fieldDelimiter = ',';
            csvFormat.skipFirstLine = false;
            csvFormat.extractHeader = true;
            cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
            String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
            client.queryBuilder().sql(sql).printCsv();
          }
        }
      

      The results show we've got a problem:

      Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
      IllegalArgumentException: length: -3 (expected: >= 0)
      

      If the last line were:

      efg,
      

      Then the offset vector should look like this:

      [0, 3, 3]
      

      Very likely we have an offset vector that looks like this instead:

      [0, 3, 0]
      

      When we compute the second column of the second row, we should compute:

      length = offset[2] - offset[1] = 3 - 3 = 0
      

      Instead we get:

      length = offset[2] - offset[1] = 0 - 3 = -3
      

      The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              paul-rogers Paul Rogers
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: