Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5498 CSV text reader does not handle duplicate header names
  3. DRILL-5491

NPE when reading a CSV file, with headers, but blank header line



    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.8.0
    • 1.17.0
    • None
    • None


      See DRILL-5490 for background.

      Try this unit test case:

          FixtureBuilder builder = ClusterFixture.builder()
          try (ClusterFixture cluster = builder.build();
               ClientFixture client = cluster.clientFixture()) {
            TextFormatConfig csvFormat = new TextFormatConfig();
            csvFormat.fieldDelimiter = ',';
            csvFormat.skipFirstLine = false;
            csvFormat.extractHeader = true;
            cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
            String sql = "SELECT * FROM `dfs.data`.`csv/test7.csv`";

      The test can also be run as a query using your favorite client.

      Using this input file:


      (The first line is blank.)

      The following is the result:

      Exception (no rows returned): org.apache.drill.common.exceptions.UserRemoteException: 
      SYSTEM ERROR: NullPointerException

      The RepeatedVarCharOutput class tries (but fails for the reasons outlined in DRILL-5490) to detect this case.

      The code crashes here in CompliantTextRecordReader.extractHeader():

          String [] fieldNames = ((RepeatedVarCharOutput)hOutput).getTextOutput();

      Because of bad code in RepeatedVarCharOutput.getTextOutput():

        public String [] getTextOutput () throws ExecutionSetupException {
          if (recordCount == 0 || fieldIndex == -1) {
            return null;
          if (this.recordStart != characterData) {
            throw new ExecutionSetupException("record text was requested before finishing record");

      Since there is no text on the line, special code elsewhere (see DRILL-5490) elects not to increment the recordCount. (BTW: recordCount is the total across-batch count, probably the in-batch count, batchIndex, was wanted here.) Since the count is zero, we return null.

      But, if the author probably thought we'd get a zero-length record, and the if-statement throws an exception in this case. But, see DRILL-5490 about why this code does not actually work.

      The result is one bug (not incrementing the record count), triggering another (returning a null), which masks a third (recordStart is not set correctly so the exception would not be thrown.)

      All that bad code is just fun and games until we get an NPE, however.




            paul-rogers Paul Rogers
            paul-rogers Paul Rogers
            0 Vote for this issue
            2 Start watching this issue