[TAJO-710] Add support for nested schemas and non-scalar types - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Data Type
Labels:
None

Description

Add support for nested schemas and non-scalar types (maps, arrays, enums, and unions). Here are some ways other systems handle nested schemas:

Pig and Hive uses complex data types, such as bags, structs, arrays, etc.
Impala doesn't support nested schemas or non-scalar data types (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_langref_unsupported.html) and disallows complex types in their Parquet support (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html).
Presto also does not support non-scalar types (http://prestodb.io/docs/current/language/types.html)

From the discussion in ~~TAJO-30~~:

I have a plan for nested schema. Currently, Tajo only supports a flat schema like relational DBMS. So, even though Tajo is extended to nested data mode, it will not break the compatibility.

I'm thinking that Tajo takes Parquet data model (= protobuf or BigQuery). When I consider nested data model, I thought two main points. Parquet data model satisfies with these points. The first point that I've thought is the processing model on nested data. Parquet data model is the same to that of BigQuery, and BigQuery already concreted the processing model including flattening, cross production on repeated fields, and aggregation on repeated fields [1][2]. The second point is file format. Parquet is a native file format for this model. Parquet already includes the efficient record assembly method. Besides, Parquet is already mature and is widely used in many systems.

[1] http://research.google.com/pubs/pub36632.html
[2] https://developers.google.com/bigquery/docs/data

I'm thinking that we need three stages for this work. Firstly, we can start with a small change to improve our schema system. Then, we will add some physical operator to just flatten one nested row into a number of flattened rows. Finally, we will solve some query optimization issues like projection/filter push down on nested schema and will add some physical operators to directly process nested rows.

If you have any idea, feel free to share with us.

Thanks,
Hyunsik

This ticket may need to be broken up into multiple sub-tasks. Each sub-task will involve defining an extension to the query language to support the data type, implementing the new data type, then adding support for the data type in each of the storage types. I have opened tickets for each of these four tasks but not as subtasks because it is very likely that each of these tasks will have subtasks of their own:

TAJO-721: Adding support for nested records
TAJO-722: Adding support for maps
TAJO-723: Adding support for array
TAJO-724: Adding support for unions

Adding support for the enum type can be a consideration, but is lower priority than the other four complex types. Neither Hive nor Pig currently have an enum type (even though storage formats such as Avro and Parquet do) and, I believe, simply convert enum values to strings.

Attachments

Issue Links

requires

TAJO-724 Add support for the union data type

Open

TAJO-721 Add support for nested tuple or struct type

In Progress

TAJO-722 Add support for the map data type

In Progress

TAJO-723 Add support for the array data type

In Progress

TAJO-809 Language extension for non-scalar types

In Progress

Activity

People

Assignee:: Hyunsik Choi

Reporter:: David Chen

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 25/Mar/14 14:30

Updated:: 03/Feb/15 08:17