This started as a question on stack overflow, but it seems like a bug.
I am testing spark pipelines using a simple dataset (attached) with 312 (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. This seems much to long for such a tiny dataset. Similar pipelines run quickly on datasets that have fewer columns and more rows. It's something about the number of columns that is causing the slow performance.
Here are a list of the stages in my pipeline:
There are 2 string columns that are converted to ints with StringIndexerModel. Then there are bucketizers that bin all the numeric columns into 2 or 3 mins each. Is there a way to bin many columns at once with a single stage? I did not see a way. Next there is a VectorAssembler to combine all the columns into one for the NaiveBayes classifier. Lastly, there is a simple SQLTransformer to cast one the prection column to an int.
Here is what the metadata for the two StringIndexerModelss looks like:
The bucketizers all look very similar. Here is what the meta data for few of them look like:
Here is the metadata for the NaiveBayes model:
and for the final SQLTransformer
Why is it that the duration gets extremely slow when more than a couple hundred columns (and only a few rows), but having millions of rows (with fewer columns) performs fine? In addition to it being slow when applying this pipeline, it is also slow to create it. The fit and evaluate steps take a few minutes each. Is there anything that can be done to make it faster?
I get similar results using 2.1.1RC, 2.1.2(tip) and 2.2.0(tip). Spark 2.1.0 gives a Janino 64k limit error when trying to build this pipeline (see https://issues.apache.org/jira/browse/SPARK-16845).
I stepped through in the debugger when pipeline.fit was called and noticed that the queryPlan is a huge nested structure. I don't know how to interpret this plan, but it is likely related to the performance problem. It is attached.