Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-28450

Follow the array size of JVM in Hive transferable objects

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.3, 4.0.0
    • 4.1.0
    • Metastore
    • None

    Description

      We are experiencing an issue with a partitioned table in Hive. When querying the table via the Hive CLI, the data retrieval works as expected without any errors. However, when attempting to query the same table through Spark, we encounter the following error in the HMS logs:

      2024-01-30 23:03:59,052 main DEBUG org.apache.logging.log4j.core.util.SystemClock does not support precise timestamps.
      Exception in thread "pool-7-thread-4" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
      	at java.util.Arrays.copyOf(Arrays.java:3236)
      	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
      	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
      	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
      	at org.apache.thrift.transport.TSaslTransport.write(TSaslTransport.java:473)
      	at org.apache.thrift.transport.TSaslServerTransport.write(TSaslServerTransport.java:42)
      	at org.apache.thrift.protocol.TBinaryProtocol.writeString(TBinaryProtocol.java:227)
      	at org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:517)
      	at org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:456)
      	at org.apache.hadoop.hive.metastore.api.FieldSchema.write(FieldSchema.java:394)
      	at org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1423)
      	at org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1250)
      	at org.apache.hadoop.hive.metastore.api.StorageDescriptor.write(StorageDescriptor.java:1116)
      	at org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:1033)
      	at org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:890)
      	at org.apache.hadoop.hive.metastore.api.Partition.write(Partition.java:786)
      	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
      	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
      	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.write(ThriftHiveMetastore.java)
      	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
      	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
      	at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:603)
      	at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:600)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
      	at org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:600)
      	at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:750)
      Exception in thread "pool-7-thread-6" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
      Exception in thread "pool-7-thread-9" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
      

      This error appears to be related to the JVM’s conservative approach to array size allocation, which limits the maximum size of arrays to prevent OutOfMemoryError exceptions. For reference, you can see a similar implementation in the JVM source code here: https://github.com/openjdk/jdk/blob/0e0dfca21f64ecfcb3e5ed7cdc2a173834faa509/src/java.base/share/classes/java/io/InputStream.java#L307-L313

      Spark side implemented similar limit on their side, it would be good to implement the same thing on Hive side - https://github.com/apache/spark/blob/e5a5921968c84601ce005a7785bdd08c41a2d862/common/utils/src/main/scala/org/apache/spark/unsafe/array/ByteArrayUtils.java

      Workaround:
      As a temporary workaround, I have been able to mitigate the issue by setting the hive.metastore.batch.retrieve.table.partition.max configuration to a lower value.

      Attachments

        1. image-2024-11-08-12-49-49-844.png
          145 kB
          Sercan Tekin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sercan.tekin Sercan Tekin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: