Description
Here's the relevant stack trace where things are hanging:
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370) at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
We should parallelize or cache or something here.
Attachments
Issue Links
- is related to
-
SPARK-2551 Cleanup FilteringParquetRowInputFormat
- Resolved
- relates to
-
PARQUET-16 Unnecessary getFileStatus() calls on all part-files in ParquetInputFormat.getSplits
- Open