Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-328

ParquetReader not using FileSystem cache effectively?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      We've seen spark job stucked with following trace:

      java.util.HashMap.put(HashMap.java:494)
      org.apache.hadoop.conf.Configuration.set(Configuration.java:1065)
      org.apache.hadoop.conf.Configuration.set(Configuration.java:1035)
      org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.mergeViewFsHdfsMountPoints(HDFSCompatibleViewFileSystem.java:491)
      org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.mergeConfFromDirectory(HDFSCompatibleViewFileSystem.java:413)
      org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.mergeViewFsAndHdfs(HDFSCompatibleViewFileSystem.java:273)
      org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.initialize(HDFSCompatibleViewFileSystem.java:190)
      org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2438)
      org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
      org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2472)
      org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2454)
      org.apache.hadoop.fs.FileSystem.get(FileSystem.java:384)
      org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
      parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
      parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
      parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
      org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
      org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
      org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      org.apache.spark.scheduler.Task.run(Task.scala:64)
      org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      java.lang.Thread.run(Thread.java:745)

      Attachments

        Activity

          People

            Unassigned Unassigned
            tianshuo Tim
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: