Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19348

pyspark.ml.Pipeline gets corrupted under multi threaded use

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 2.0.0, 2.1.0, 2.2.0
    • 2.0.3, 2.1.1, 2.2.0
    • ML, PySpark
    • None

    Description

      When pyspark.ml.Pipeline objects are constructed concurrently in separate python threads, it is observed that the stages used to construct a pipeline object get corrupted i.e the stages supplied to a Pipeline object in one thread appear inside a different Pipeline object constructed in a different thread.

      Things work fine if construction of pyspark.ml.Pipeline objects is serialized, so this looks like a thread safety problem with pyspark.ml.Pipeline object construction.

      Confirmed that the problem exists with Spark 1.6.x as well as 2.x.

      While the corruption of the Pipeline stages is easily caught, we need to know if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) are also affected by the underlying cause of this problem. That is, whether other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed in separate threads (on distinct pipeline objects) concurrently without any cross contamination between them.

      Attachments

        1. pyspark_pipeline_threads.py
          3 kB
          Vinayak Joshi

        Activity

          People

            bryanc Bryan Cutler
            vijoshi Vinayak Joshi
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: