Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19348

pyspark.ml.Pipeline gets corrupted under multi threaded use

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.6.0, 2.0.0, 2.1.0, 2.2.0
    • Fix Version/s: 2.0.3, 2.1.1, 2.2.0
    • Component/s: ML, PySpark
    • Labels:
      None

      Description

      When pyspark.ml.Pipeline objects are constructed concurrently in separate python threads, it is observed that the stages used to construct a pipeline object get corrupted i.e the stages supplied to a Pipeline object in one thread appear inside a different Pipeline object constructed in a different thread.

      Things work fine if construction of pyspark.ml.Pipeline objects is serialized, so this looks like a thread safety problem with pyspark.ml.Pipeline object construction.

      Confirmed that the problem exists with Spark 1.6.x as well as 2.x.

      While the corruption of the Pipeline stages is easily caught, we need to know if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) are also affected by the underlying cause of this problem. That is, whether other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed in separate threads (on distinct pipeline objects) concurrently without any cross contamination between them.

        Attachments

        1. pyspark_pipeline_threads.py
          3 kB
          Vinayak Joshi

          Activity

            People

            • Assignee:
              bryanc Bryan Cutler
              Reporter:
              vijoshi Vinayak Joshi
              Shepherd:
              Joseph K. Bradley
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: