[HIVE-28026] Reading proto data more than 2GB from multiple splits fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0-beta-1
Fix Version/s: Not Applicable
Component/s: None
Labels:
- pull-request-available
Environment:

Description

Query: select * from <table_name>

Explanation:

On running the above mentioned query on a hive proto table, multiple tez containers will be spawned to process the data. In a container, if there are multiple hdfs splits and the combined size of decompressed data is more than 2GB then the query fails with the following error:

"While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length."

This is happening because of CodedInputStream i.e. byteLimit += totalBytesRetired + pos;

byteLimit is __ getting InterOverflow as totalBytesRetired is retaining count of all the bytes that it has read as CodedInputStream is initiliazed once for a container https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96 .

This is different from issue reproduced in https://github.com/zabetak/protobuf-large-message as there it is a single proto data file more than 2GB, but in my case, there are multiple file total resulting in 2GB.

CC zabetak

Limitation:

This fix will still not resolve the issue which is mentioned https://github.com/protocolbuffers/protobuf/issues/11729

Here is DDL:

beeline  -u 'jdbc:hive2://hostnames/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;thrift.client.max.message.size=2147483647' --showHeader=false --outputformat=tsv2 -e "select * from raaggarw.proto_hive_query_data where executionmode='MR' and otherinfo['CONF'] != 'NULL'" >> ./output

Attachments

Issue Links

is related to

TEZ-4540 Reading proto data more than 2GB from multiple splits fails

Resolved

links to

GitHub Pull Request #5033

Activity

People

Assignee:: Raghav Aggarwal

Reporter:: Raghav Aggarwal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jan/24 13:16

Updated:: 23/Sep/24 08:06

Resolved:: 06/Aug/24 16:49