Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5551

`columns` changes meaning for CSV files depending on query

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.10.0
    • None
    • None
    • None

    Description

      Drill's CSV column reader supports two forms of files:

      • Files with column headers as the first line of the file.
      • Files without column headers.

      The CSV storage plugin specifies which format to use for files accessed via that storage plugin config.

      Suppose we have a CSV file with headers:

      a,b,c
      10,foo,bar
      

      Suppose we configure a storage plugin to use headers:

          TextFormatConfig csvFormat = new TextFormatConfig();
          csvFormat.fieldDelimiter = ',';
          csvFormat.skipFirstLine = false;
          csvFormat.extractHeader = true;
      

      (The above can also be done using JSON when running Drill as a server.)

      Suppose we execute this query:

      SELECT columns FROM `dfs.data.example.csv`
      

      The result is a single column, the special columns array, that contains all three fields.

      Suppose we alter the query just a bit:

      SELECT columns, a FROM `dfs.data.example.csv`
      

      Now the result set is two non-nullable Varchar columns:

      columns,a
      ,10
      

      It seems that the meaning of `columns` shifts depending on whether the value appears by itself or as part of a SELECT list.

      Perhaps this handles the case of a file such as:

      columns,values
      a;b,10;10
      c;d,20;30
      

      That is fine. but what if I just wanted the first column:

      SELECT columns FROM `dfs.data.strange.csv`
      

      How would the code know if columns was the special column vs. the normal column called "columns"?

      Perhaps one long-term solution is to make columns into a table function (as has been proposed for the implicit columns):

      SELECT columns(t) FROM `dfs.data.strange.csv` AS t
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: