Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
In Spark, partitioned parquet output is written with directories like:
/column1=1 /column2=hello /data.parquet /column2=world /moredata.parquet /column1=2
However, when querying these files with Drill we end up interpreting the directories as strings when what they really are is column names + values. In the data files we only have the remaining columns. Querying this with drill means that you can really only have a couple of data types (far short of what spark/parquet supports) in the column and still have correct operations.
Given the size of the data, I don't want to have to CTAS all the parquet files (especially as they are being periodically updated).
I think this ends up being a nice addition for general file directory reads as well since many people already encode meaning into their directory structure, but having self describing directories is even better.