Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
Description
It can take up to half a day to explode a modest-sized nested collection (0.5m).
On a recent Xeon processors.
See attached pyspark script that reproduces this problem.
cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + table_name).cache()
print sqlc.count()
This script generate a number of tables, with the same total number of records across all nested collection (see `scaling` variable in loops). `scaling` variable scales up how many nested elements in each record, but by the same factor scales down number of records in the table. So total number of records stays the same.
Time grows exponentially (notice log-10 vertical axis scale):
At scaling of 50,000 (see attached pyspark script), it took 7 hours to explode the nested collections (!) of 8k records.
After 1000 elements in nested collection, time grows exponentially.
Attachments
Attachments
Issue Links
- is related to
-
SPARK-4502 Spark SQL reads unneccesary nested fields from Parquet
- Resolved
- relates to
-
SPARK-16998 select($"column1", explode($"column2")) is extremely slow
- Resolved
-
SPARK-22330 Linear containsKey operation for serialized maps.
- Resolved
-
SPARK-15214 Implement code generation for Generate
- Resolved
-
SPARK-22385 MapObjects should not access list element by index
- Resolved
- links to