Details
Description
When pyspark.ml.Pipeline objects are constructed concurrently in separate python threads, it is observed that the stages used to construct a pipeline object get corrupted i.e the stages supplied to a Pipeline object in one thread appear inside a different Pipeline object constructed in a different thread.
Things work fine if construction of pyspark.ml.Pipeline objects is serialized, so this looks like a thread safety problem with pyspark.ml.Pipeline object construction.
Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
While the corruption of the Pipeline stages is easily caught, we need to know if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) are also affected by the underlying cause of this problem. That is, whether other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed in separate threads (on distinct pipeline objects) concurrently without any cross contamination between them.