Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0-beta-1
-
None
-
Description
Query: select * from <table_name>
Explanation:
On running the above mentioned query on a hive proto table, multiple tez containers will be spawned to process the data. In a container, if there are multiple hdfs splits and the combined size of decompressed data is more than 2GB then the query fails with the following error:
"While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length."
This is happening because of CodedInputStream i.e. byteLimit += totalBytesRetired + pos;
byteLimit is __ getting InterOverflow as totalBytesRetired is retaining count of all the bytes that it has read as CodedInputStream is initiliazed once for a container https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96 .
This is different from issue reproduced in https://github.com/zabetak/protobuf-large-message as there it is a single proto data file more than 2GB, but in my case, there are multiple file total resulting in 2GB.
CC zabetak
Limitation:
This fix will still not resolve the issue which is mentioned https://github.com/protocolbuffers/protobuf/issues/11729
Here is DDL:
beeline -u 'jdbc:hive2://hostnames/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;thrift.client.max.message.size=2147483647' --showHeader=false --outputformat=tsv2 -e "select * from raaggarw.proto_hive_query_data where executionmode='MR' and otherinfo['CONF'] != 'NULL'" >> ./output
Attachments
Issue Links
- is related to
-
TEZ-4540 Reading proto data more than 2GB from multiple splits fails
- Resolved
- links to