Details
-
New Feature
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
This ticket is to track the work for defining the syntax for nested schemas, maps, arrays, and unions and the work for adding the syntax to the parser. Initially, we can add stubs for the parser endpoints that will then be fleshed out when support for the data type is actually implemented (see other subtasks of TAJO-710).
I have an idea of a possible DDL syntax for these types, and I would like to get your feedback on it. I considered just using Hive's syntax but I felt that it was not the best syntax for these types.
Instead of calling nested records "structs" like the way Hive does, I simply call them records as well and use the same syntax used for declaring the top-level record fields:
create table record_example ( nested_field record ( field1 int, field2 double), two_levels_nested record ( inner_nested record ( field3 string, field4 int), field5 int), ) using parquet;
For arrays, maps, and unions, I am using a syntax inspired by Scala's syntax for generics:
create table array_example ( int_array array[int], record_array array[record ( field1 int, field2 string)] ) using avro; create table map_example ( string_to_int map[string, int], int_to_record map[int, record ( field1 string, field2 int)], ) using avro; create table union_example ( integers union[bit, smallint, integer, bigint] ) using parquet;
Of course, it is possible that when we implement these data types, we may make changes to the syntax, but for now, I think we should define an initial language. Once the initial syntax has stabilized, I will write a formal grammar for it.