[HUDI-4205] Reading metadata table on S3 using Spark throws NullPointerException during createHFileReader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.11.0
Fix Version/s: 0.11.1
Component/s: metadata, spark
Labels:
- pull-request-available

Story Points:
2

Description

Environment: EMR 6.6.0, OSS Spark 3.2.1, Hudi master

Storage: S3

When loading the metadata table in Spark shell using the following code, it throws NullPointerException. In this case, the metadata table has the base files in HFile format.

This also happens for the following combinations: (1) Spark 3.1.3, Hudi 0.11.0 (1) Spark 3.2.1, Hudi 0.11.0

spark.read.format("hudi").load("s3a://<base_path>/.hoodie/metadata/").show

Caused by: java.lang.NullPointerException
  at org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:178)
  at org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:167)
  at org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:163)
  at org.apache.hudi.HoodieBaseRelation$.$anonfun$createHFileReader$1(HoodieBaseRelation.scala:531)
  at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$1(HoodieBaseRelation.scala:482)
  at org.apache.hudi.HoodieMergeOnReadRDD.readBaseFile(HoodieMergeOnReadRDD.scala:130)
  at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:100)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:131)
  at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:750)

Spark shell:

./bin/spark-shell  \
     --master yarn \
     --deploy-mode client \
     --driver-memory 20g \
     --executor-memory 20g \
     --num-executors 2 \
     --executor-cores 8 \
     --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
     --conf spark.kryoserializer.buffer=256m \
     --conf spark.kryoserializer.buffer.max=1024m \
     --jars /home/hadoop/hudi-spark3.2-bundle_2.12-0.12.0-SNAPSHOT.jar \
     --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps' \
     --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
     --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Attachments

Issue Links

links to

GitHub Pull Request #5841

Activity

People

Assignee:: Ethan Guo (this is the old account; please use "yihua")

Reporter:: Ethan Guo (this is the old account; please use "yihua")

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Jun/22 22:35

Updated:: 21/May/24 19:55

Resolved:: 11/Jun/22 21:47