Description
If there are too many small stripes and with many columns, the overhead for storing metadata (column stats) can exceed the default protobuf message size of 64MB. Reading such files will throw the following exception
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
at com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
The only solution for this is to programmatically increase the CodeInputStream size limit. We should make this configurable via hive config so that the orc file is readable. Alternatively, we can keep increasing the size until it parsing succeeds.
Attachments
Attachments
Issue Links
- is related to
-
HIVE-26987 InvalidProtocolBufferException when reading column statistics from ORC files
- Open
-
ORC-1361 InvalidProtocolBufferException when reading large stripe statistics
- Resolved
-
SPARK-19109 ORC metadata section can sometimes exceed protobuf message size limit
- Resolved