Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5487

Vector corruption in CSV with headers and truncated last row

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.10.0
    • 1.17.0
    • Storage - Text & CSV
    • None

    Description

      The CSV format plugin allows two ways of reading data:

      • As named columns
      • As a single array, called columns, that holds all columns for a row

      The named columns feature will corrupt the offset vectors if the last row of the file is truncated: leaves off one or more columns.

      To illustrate the CSV data corruption, I created a CSV file, test4.csv, of the following form:

      h,u
      abc,def
      ghi
      

      Note that the file is truncated: the command and second field is missing on the last line.

      Then, I created a simple test using the "cluster fixture" framework:

        @Test
        public void readerTest() throws Exception {
          FixtureBuilder builder = ClusterFixture.builder()
              .maxParallelization(1);
      
          try (ClusterFixture cluster = builder.build();
               ClientFixture client = cluster.clientFixture()) {
            TextFormatConfig csvFormat = new TextFormatConfig();
            csvFormat.fieldDelimiter = ',';
            csvFormat.skipFirstLine = false;
            csvFormat.extractHeader = true;
            cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
            String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
            client.queryBuilder().sql(sql).printCsv();
          }
        }
      

      The results show we've got a problem:

      Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
      IllegalArgumentException: length: -3 (expected: >= 0)
      

      If the last line were:

      efg,
      

      Then the offset vector should look like this:

      [0, 3, 3]
      

      Very likely we have an offset vector that looks like this instead:

      [0, 3, 0]
      

      When we compute the second column of the second row, we should compute:

      length = offset[2] - offset[1] = 3 - 3 = 0
      

      Instead we get:

      length = offset[2] - offset[1] = 0 - 3 = -3
      

      The summary is that a premature EOF appears to cause the "missing" columns to be skipped; they are not filled with a blank value to "bump" the offset vectors to fill in the last row. Instead, they are left at 0, causing havoc downstream in the query.

      Attachments

        Activity

          People

            Unassigned Unassigned
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: