Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-4205

Reading metadata table on S3 using Spark throws NullPointerException during createHFileReader

    XMLWordPrintableJSON

Details

    • 2

    Description

      Environment: EMR 6.6.0, OSS Spark 3.2.1, Hudi master

      Storage: S3

      When loading the metadata table in Spark shell using the following code, it throws NullPointerException.  In this case, the metadata table has the base files in HFile format.

      This also happens for the following combinations: (1) Spark 3.1.3, Hudi 0.11.0 (1) Spark 3.2.1, Hudi 0.11.0 

      spark.read.format("hudi").load("s3a://<base_path>/.hoodie/metadata/").show 

       

      Caused by: java.lang.NullPointerException
        at org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:178)
        at org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:167)
        at org.apache.hudi.org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:163)
        at org.apache.hudi.HoodieBaseRelation$.$anonfun$createHFileReader$1(HoodieBaseRelation.scala:531)
        at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$1(HoodieBaseRelation.scala:482)
        at org.apache.hudi.HoodieMergeOnReadRDD.readBaseFile(HoodieMergeOnReadRDD.scala:130)
        at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:100)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750) 

      Spark shell:

      ./bin/spark-shell  \
           --master yarn \
           --deploy-mode client \
           --driver-memory 20g \
           --executor-memory 20g \
           --num-executors 2 \
           --executor-cores 8 \
           --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
           --conf spark.kryoserializer.buffer=256m \
           --conf spark.kryoserializer.buffer.max=1024m \
           --jars /home/hadoop/hudi-spark3.2-bundle_2.12-0.12.0-SNAPSHOT.jar \
           --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps' \
           --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
           --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' 

       

       

      Attachments

        Issue Links

          Activity

            People

              guoyihua Ethan Guo (this is the old account; please use "yihua")
              guoyihua Ethan Guo (this is the old account; please use "yihua")
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: