Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2119

Reading Parquet InputSplits dominates query execution time when reading off S3

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.0
    • 1.1.0
    • SQL
    • None

    Description

      Here's the relevant stack trace where things are hanging:

      	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
      	at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370)
      	at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
      	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
      	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
      	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
      

      We should parallelize or cache or something here.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              marmbrus Michael Armbrust
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: