Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.15.0
-
None
-
None
Description
Consider the simple case in DRILL-7082. That ticket talks about implicit partition columns created by the wildcard. Consider a very similar case:
SELECT a, b, c, dir0, dir1 FROM `myTable`
Where myTable is a directory of CSV files, each with schema (a, b, c):
myTable |- file1.csv |- nested |- file2.csv
If the query is run in "stock" Drill, the planner will place both files within a single scan operator as described in DRILL-7082. The result schema will be:
(a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
Notice that last column: why is "dir1" a (nullable) INT? The partition mechanism only recognizes partitions that actually exist, leaving the Project operator to fill in (with a Nullable INT) any partitions that don't exist (any directory levels not actually seen by the scan operator.)
Now, using the same trick as in DRILL-7082, try the query
SELECT a, b, c, dir0 FROM `myTable`
Again, the trick causes Drill to read each file in a separate scan operator (simulating what happens when queries run at scale.)
The scan operator for file1.csv will see no partitions, so it will omit "dir0" and the Project operator will helpfully fill in a Nullable INT. The scan operator for file2.csv sees one level of partition, so sets dir0 to nested as a Nullable VARCHAR.
What does the client see? Two records: one with "dir0" as a Nullable INT, the other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema change between the two records.
The two cases described above are really two versions of the same issue. Clients expect that, if they use the "dir0", "dir1", ... columns, that the type is always Nullable Varchar so that the schema stays consistent across batches.