Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5498 CSV text reader does not handle duplicate header names
  3. DRILL-5491

NPE when reading a CSV file, with headers, but blank header line

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.8.0
    • 1.17.0
    • None
    • None

    Description

      See DRILL-5490 for background.

      Try this unit test case:

          FixtureBuilder builder = ClusterFixture.builder()
              .maxParallelization(1);
      
          try (ClusterFixture cluster = builder.build();
               ClientFixture client = cluster.clientFixture()) {
            TextFormatConfig csvFormat = new TextFormatConfig();
            csvFormat.fieldDelimiter = ',';
            csvFormat.skipFirstLine = false;
            csvFormat.extractHeader = true;
            cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
            String sql = "SELECT * FROM `dfs.data`.`csv/test7.csv`";
            client.queryBuilder().sql(sql).printCsv();
          }
        }
      

      The test can also be run as a query using your favorite client.

      Using this input file:

      a,b,c
      d,e,f
      

      (The first line is blank.)

      The following is the result:

      Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: 
      SYSTEM ERROR: NullPointerException
      

      The RepeatedVarCharOutput class tries (but fails for the reasons outlined in DRILL-5490) to detect this case.

      The code crashes here in CompliantTextRecordReader.extractHeader():

          String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput();
      

      Because of bad code in RepeatedVarCharOutput.getTextOutput():

        public String [] getTextOutput () throws ExecutionSetupException {
          if (recordCount == 0 || fieldIndex == -1) {
            return null;
          }
      
          if (this.recordStart != characterData) {
            throw new ExecutionSetupException("record text was requested before finishing record");
          }
      

      Since there is no text on the line, special code elsewhere (see DRILL-5490) elects not to increment the recordCount. (BTW: recordCount is the total across-batch count, probably the in-batch count, batchIndex, was wanted here.) Since the count is zero, we return null.

      But, if the author probably thought we'd get a zero-length record, and the if-statement throws an exception in this case. But, see DRILL-5490 about why this code does not actually work.

      The result is one bug (not incrementing the record count), triggering another (returning a null), which masks a third (recordStart is not set correctly so the exception would not be thrown.)

      All that bad code is just fun and games until we get an NPE, however.

      Attachments

        Activity

          People

            paul-rogers Paul Rogers
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: