Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
It looks to me that Drill is not properly converting array values in Parquet records. I have created a simple example and will attach a simple Parquet file to this issue. If I write Parquet records using the Avro schema
{ "type": "record", "name": "Book", "fields": [ { "name": "title", "type": "string" }, { "name": "pages", "type": "int" }, { "name": "authors", "type": {"type": "array", "items": "string"} } ] }
I write two records using this schema into the attached Parquet file and then simply run SELECT * FROM dfs.`books.parquet` I get the following result:
title | pages | authors |
---|---|---|
Physics of Waves | 477 | {"array":["William C. Elmore","Mark A. Heald"]} |
Foundations of Mathematical Analysis | 428 | {"array":["Richard Johnsonbaugh","W.E. Pfaffenberger"]} |
You can see that the authors column seems to be a nested record with the name "array" instead of being a repeated value. If I change the SQL query to SELECT title,pages,t.authors.`array` FROM dfs.`/home/davek/src/drill-parquet-example/resources/books.parquet` t; then I get:
title | pages | EXPR$2 |
---|---|---|
Physics of Waves | 477 | ["William C. Elmore","Mark A. Heald"] |
Foundations of Mathematical Analysis | 428 | ["Richard Johnsonbaugh","W.E. Pfaffenberger"] |
and now that column behaves in Drill as a repeated values column.