Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.4.0
-
None
Description
It seems that
.limit()
is much less efficient than it could be/one would expect when reading a large dataset from parquet:
val sample = spark.read.parquet("/Some/Large/Data.parquet").limit(1000) // Do something with sample ...
This might take hours, depending on the size of the data.
By comparison,
spark.read.parquet("/Some/Large/Data.parquet").show(1000)
is essentially instant.