Description
Any attempt to obtain the stripe statistics from an ORC file with a metadata section exceeding the hardcoded protobuf limit of 1GB(https://github.com/apache/orc/blob/2ff9001ddef082eaa30e21cbb034f266e0721664/java/core/src/java/org/apache/orc/impl/InStream.java#L41) leads to the following exception.
com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:154) at com.google.protobuf.CodedInputStream$StreamDecoder.readRawBytesSlowPathOneChunk(CodedInputStream.java:2954) at com.google.protobuf.CodedInputStream$StreamDecoder.readBytesSlowPath(CodedInputStream.java:3035) at com.google.protobuf.CodedInputStream$StreamDecoder.readBytes(CodedInputStream.java:2446) at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2118) at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:2070) at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3285) at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:3279) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2423) at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8172) at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:8093) at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10494) at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:10488) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2423) at org.apache.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:23549) at org.apache.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:23499) at org.apache.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:24247) at org.apache.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:24241) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2423) at org.apache.orc.OrcProto$Metadata.<init>(OrcProto.java:24352) at org.apache.orc.OrcProto$Metadata.<init>(OrcProto.java:24302) at org.apache.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:25048) at org.apache.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:25042) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) at com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:357) at org.apache.orc.OrcProto$Metadata.parseFrom(OrcProto.java:24557) at org.apache.orc.impl.ReaderImpl.deserializeStripeStats(ReaderImpl.java:1040) at org.apache.orc.impl.ReaderImpl.getVariantStripeStatistics(ReaderImpl.java:325) at org.apache.orc.impl.ReaderImpl.getStripeStatistics(ReaderImpl.java:1074) at org.apache.orc.impl.ReaderImpl.getStripeStatistics(ReaderImpl.java:1061)
There are various ways of ending up with an ORC file that has a large metadata section since the write never fails.
Once the file is created it is no longer possible to read back all the information correctly.
In versions without ORC-520 (before 1.6.0) the file cannot be read at all since stripe statistics are read eagerly in the constructor of the ReaderImpl.
In versions with ORC-520 (1.6.0 onwards) the exception is raised only when trying to read explicitly the stripe statistics.
Attached a test case (TestOrcWithLargeStripeStatistics.java) reproducing the problem in current main branch (2ff9001ddef082eaa30e21cbb034f266e0721664).
Attachments
Attachments
Issue Links
- causes
-
HIVE-26987 InvalidProtocolBufferException when reading column statistics from ORC files
- Open
- relates to
-
ORC-520 Fix file merging for column encryption.
- Closed
-
HIVE-11268 java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
- Open
-
HIVE-11592 ORC metadata section can sometimes exceed protobuf message size limit
- Closed
- links to