[IMPALA-5052] Read and write signed integer logical type metadata in Parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: Impala 2.7.0
Fix Version/s: Impala 2.12.0
Component/s: Frontend
Labels:
- newbie
- parquet
- ramp-up
- usability

Description

Some systems (e.g Spark) write Parquet files with integral types using logical types. Impala fails to handle these logical types when constructing a table from an existing Parquet file. However, reading data from such files works fine.

For example, consider a file the following Parquet schema:

[ec2-user@ip-172-31-61-61 ~]$ parquet-tools schema part-r-00000-a409eea5-3d4f-4172-b376-659005f65489.gz.parquet
message spark_schema {
  optional int32 id;
  optional int32 tinyint_col (INT_8);
  optional int32 smallint_col (INT_16);
  optional int32 int_col;
  optional int64 bigint_col;
}

A CREATE TABLE ... LIKE PARQUET statement fails with something like the following:

ERROR: AnalysisException: Unsupported logical parquet type INT_8 (primitive type is INT32) for field tinyint_col

This functionality is handled by the convertLogicalParquetType method in the com.cloudera.impala.analysis.CreateTableLikeFileStmt class, which currently does not handle integer logical types.

See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#numeric-types for information about the mapping between logical types and encodings.

We should implement read and write support for this metadata, i.e. allow correct round-tripping of tinyint and smallint types.

Attachments

Activity

People

Assignee:: Anuj Phadke

Reporter:: Ian Buss

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Mar/17 01:11

Updated:: 20/Jan/18 00:46

Resolved:: 20/Jan/18 00:46