Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5548

SELECT * against an empty CSV file with headers produces error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.10.0
    • None
    • Storage - Text & CSV
    • None

    Description

      Drill's CSV column reader supports two forms of files:

      • Files with column headers as the first line of the file.
      • Files without column headers.

      The CSV storage plugin specifies which format to use for files accessed via that storage plugin config.

      Suppose we have a empty file. When queried in the CSV configuration without headers, the query works. The schema returned is the columns Varchar array, and the results contain no rows. Good.

      Now, query the same file with the CSV plugin configured to use headers.

          TextFormatConfig csvFormat = new TextFormatConfig();
          csvFormat.fieldDelimiter = ',';
          csvFormat.skipFirstLine = false;
          csvFormat.extractHeader = true;
      

      (The above can also be done using JSON when running Drill as a server.)

      Query the file with the following query:

      SELECT * FROM `dfs.data.empty.csv`
      

      We get the following exception:

      org.apache.drill.common.exceptions.UserRemoteException: 
      SYSTEM ERROR: IllegalStateException: 
      Incoming batch [#4, ProjectRecordBatch] has an empty schema. 
      This is not allowed.
      

      This particular case is a bit tricky. First, we want headers, but there are none. We can interpret this as an error (a file with headers must have headers). Or, we an treat it as a file that happens to have no columns. The latter choice is a bit more general.

      The file also has no data rows. This could be an error, or it too could just be treated as a result set of zero rows.

      Combined, the result set is one with no columns and no rows: an empty result set. This is actually a valid (if not very useful) result in SQL.

      Conversation with Jinfeng suggested that, in such a scenario, the reader is supposed to make up a dummy column so that the result is not empty. While this is a workaround, it seems to just push the problem from the Project operator into each of many record readers.

      Another alternative is to revert to the columns column: generate a result set with the columns array, but with no data. This solution avoids the empty batch problem.

      Attachments

        Activity

          People

            Unassigned Unassigned
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: