Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.0.0-alpha-2
-
None
-
None
Description
Any attempt to read an ORC file (query an ORC table) having a metadata section with column statistics exceeding the hardcoded limit of 1GB (https://github.com/apache/orc/blob/2ff9001ddef082eaa30e21cbb034f266e0721664/java/core/src/java/org/apache/orc/impl/InStream.java#L41) leads to the following exception.
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:162) at com.google.protobuf.CodedInputStream$StreamDecoder.readRawBytesSlowPathOneChunk(CodedInputStream.java:2940) at com.google.protobuf.CodedInputStream$StreamDecoder.readBytesSlowPath(CodedInputStream.java:3021) at com.google.protobuf.CodedInputStream$StreamDecoder.readBytes(CodedInputStream.java:2432) at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1718) at org.apache.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1663) at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1766) at org.apache.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1761) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2409) at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:6552) at org.apache.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:6468) at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:6678) at org.apache.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:6673) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2409) at org.apache.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:19586) at org.apache.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:19533) at org.apache.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:19622) at org.apache.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:19617) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2409) at org.apache.orc.OrcProto$Metadata.<init>(OrcProto.java:20270) at org.apache.orc.OrcProto$Metadata.<init>(OrcProto.java:20217) at org.apache.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:20306) at org.apache.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:20301) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:86) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:91) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:48) at org.apache.orc.OrcProto$Metadata.parseFrom(OrcProto.java:20438) at org.apache.orc.impl.ReaderImpl.deserializeStripeStats(ReaderImpl.java:1013) at org.apache.orc.impl.ReaderImpl.getVariantStripeStatistics(ReaderImpl.java:317) at org.apache.orc.impl.ReaderImpl.getStripeStatistics(ReaderImpl.java:1047) at org.apache.orc.impl.ReaderImpl.getStripeStatistics(ReaderImpl.java:1034) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:1679) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.callInternal(OrcInputFormat.java:1557) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.access$2900(OrcInputFormat.java:1342) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1529) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator$1.run(OrcInputFormat.java:1526) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1526) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:1342) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
There are various ways of creating such a file and once this happens it is no longer possible to read it back. A complete reproducer of the problem using Hive is attached in orc_large_column_metadata.q file.
Reproducible in current master (2031af314e70f3b8e07add13cb65416c29956181) by running:
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=orc_large_column_metadata.q
Increase java heap accordingly "-Xmx8g" while running the test to avoid hitting OOM before the actual error.
Attachments
Attachments
Issue Links
- is caused by
-
ORC-1361 InvalidProtocolBufferException when reading large stripe statistics
-
- Open
-
- relates to
-
HIVE-11268 java.io.IOException: com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit.
-
- Open
-
-
HIVE-11592 ORC metadata section can sometimes exceed protobuf message size limit
-
- Closed
-