[HIVE-2950] Hive should store the full table schema in partition storage descriptors - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Hive tables have a schema, which is copied into the partition storage descriptor when adding a partition. Currently only columns stored in the table storage descriptor are copied - columns that are reported by the serde are not copied. Instead of copying the table storage descriptor columns into the partition columns, the full table schema should be copied.

DETAILS

This is a little long but is necessary to show 3 things: current behavior when explicitly listing columns, behavior with ~~HIVE-2941~~ patched in and serde reported columns, and finally the behavior with this patch (full table schema copied into the partition storage descriptor).

Here's an example of what currently happens. Note the following:

the two manually-defined fields defined for the table are listed in the table storage descriptor.
both fields are present in the partition storage descriptor

This works great because users who query for a partition can look at its storage descriptor and get the schema.

hive> create external table foo_test (name string, age int) partitioned by (part_dt string);
hive> describe extended foo_test;
OK
name	string	
age	int	
part_dt	string	
	 	 
Detailed Table Information	Table(tableName:foo_test, dbName:travis_test, owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/warehouse/travis_test.db/foo_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
Time taken: 0.082 seconds

hive> alter table foo_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00';
hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
OK
name	string	
age	int	
part_dt	string	
	 	 
Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test, tableName:foo_test, createTime:1334256131, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334256131})

CURRENT BEHAVIOR WITH ~~HIVE-2941~~ PATCHED IN

Now let's examine what happens when creating a table when the serde reports the schema. Notice the following:

The table storage descriptor contains an empty list of columns. However, the table schema is available from the serde reflecting on the serialization class.
The partition storage descriptor does contain a single "part_dt" column that was copied from the table partition keys. The actual data columns are not present.

hive> create external table travis_test.person_test partitioned by (part_dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.08 seconds
hive> describe extended person_test;
OK
name	struct<first_name:string,last_name:string>	from deserializer
id	int	from deserializer
email	string	from deserializer
phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
part_dt	string	
	 	 
Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
Time taken: 0.147 seconds
hive> alter table person_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00'; 
OK
Time taken: 0.149 seconds
hive> describe extended person_test partition (part_dt = '20120331T000000Z');
OK
part_dt	string	
	 	 
Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test, tableName:person_test, createTime:1334257029, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257029})	
Time taken: 0.106 seconds
hive>

PROPOSED BEHAVIOR

I believe the correct thing to do is copy the full table schema (serde-reported columns + partition keys) into the partition storage descriptor. Notice the following:

Table storage descriptor does not contain any columns, because they are reported by the serde.
Partition storage descriptor now contains both the serde-reported schema, and full table schema.

hive> create external table travis_test.person_test partitioned by (part_dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
OK
Time taken: 0.076 seconds
hive> describe extended person_test;                                                                                                                                     OK                                                                                                                                                                       name    struct<first_name:string,last_name:string>      from deserializer
id	int	from deserializer
email	string	from deserializer
phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
part_dt	string	
	 	 
Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
Time taken: 0.155 seconds
hive> alter table person_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00';
OK                                                                                                                                                                       Time taken: 0.296 seconds                                        
hive> describe extended person_test partition (part_dt = '20120331T000000Z');                                                                                            OK                                                                                                                                                                       name    struct<first_name:string,last_name:string>      from deserializer
id	int	from deserializer
email	string	from deserializer
phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
part_dt	string	
	 	 
Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test, tableName:person_test, createTime:1334257504, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:struct<first_name:string,last_name:string>, comment:from deserializer), FieldSchema(name:id, type:int, comment:from deserializer), FieldSchema(name:email, type:string, comment:from deserializer), FieldSchema(name:phones, type:array<struct<number:string,type:struct<value:int>>>, comment:from deserializer), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257504})	
Time taken: 0.133 seconds
hive>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ASF.LICENSE.NOT.GRANTED--HIVE-2950.D2769.1.patch
12/Apr/12 19:19
1 kB
Phabricator

Issue Links

is blocked by

HIVE-2941 Hive should expand nested structs when setting the table schema from thrift structs

Closed

relates to

HIVE-3144 Custom Deserializers Should be Used for Thrift Table Objects

Open

HCATALOG-443 Serde-reported schema support, enums as strings, misc fixes

Resolved

Hive should store the full table schema in partition storage descriptors

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates