Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7083

Wrong data type for explicit partition column beyond file depth

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.15.0
    • 1.17.0
    • None
    • None

    Description

      Consider the simple case in DRILL-7082. That ticket talks about implicit partition columns created by the wildcard. Consider a very similar case:

      SELECT a, b, c, dir0, dir1 FROM `myTable`
      

      Where myTable is a directory of CSV files, each with schema (a, b, c):

      myTable
      |- file1.csv
      |- nested
         |- file2.csv
      

      If the query is run in "stock" Drill, the planner will place both files within a single scan operator as described in DRILL-7082. The result schema will be:

      (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
      

      Notice that last column: why is "dir1" a (nullable) INT? The partition mechanism only recognizes partitions that actually exist, leaving the Project operator to fill in (with a Nullable INT) any partitions that don't exist (any directory levels not actually seen by the scan operator.)

      Now, using the same trick as in DRILL-7082, try the query

      SELECT a, b, c, dir0 FROM `myTable`
      

      Again, the trick causes Drill to read each file in a separate scan operator (simulating what happens when queries run at scale.)

      The scan operator for file1.csv will see no partitions, so it will omit "dir0" and the Project operator will helpfully fill in a Nullable INT. The scan operator for file2.csv sees one level of partition, so sets dir0 to nested as a Nullable VARCHAR.

      What does the client see? Two records: one with "dir0" as a Nullable INT, the other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema change between the two records.

      The two cases described above are really two versions of the same issue. Clients expect that, if they use the "dir0", "dir1", ... columns, that the type is always Nullable Varchar so that the schema stays consistent across batches.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Paul.Rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: