[SPARK-2119] Reading Parquet InputSplits dominates query execution time when reading off S3 - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: SQL
Labels:
None

Target Version/s:

1.1.0

Description

Here's the relevant stack trace where things are hanging:

	at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
	at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:370)
	at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)

We should parallelize or cache or something here.