Description
When a PySpark model is created after fitting data, its UID is initialized to the parent estimator's value. Before this assignment, any params defined in the model are copied from the object to the class in Params._copy_params() and assigned a different parent UID. This causes PySpark to think the params are not owned by the model and can lead to a ValueError raised from Params._shouldOwn(), such as:
ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', name='outputCol', doc='output column name.') does not belong to CountVectorizer_4c8e9fd539542d783e66.
I encountered this problem while working on SPARK-13967 where I tried to add the shared params HasInputCol and HasOutputCol to CountVectorizerModel. See the attached file feature.py for the WIP.
Using the modified 'feature.py', this sample code shows the mixup in UIDs and produces the error above.
sc = SparkContext(appName="count_vec_test") sqlContext = SQLContext(sc) df = sqlContext.createDataFrame( [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", "raw"]) cv = CountVectorizer(inputCol="raw", outputCol="vectors") model = cv.fit(df) print(model.uid) for p in model.params: print(str(p)) model.transform(df).show(truncate=False)
output (the UIDs should match):
CountVectorizer_4c8e9fd539542d783e66 CountVectorizerModel_4336a81ba742b2593fef__binary CountVectorizerModel_4336a81ba742b2593fef__inputCol CountVectorizerModel_4336a81ba742b2593fef__outputCol
In the Scala implementation of this, the model overrides the UID value, which the Params use when they are constructed, so they all end up with the parent estimator UID.
Attachments
Attachments
Issue Links
- Is contained by
-
SPARK-14771 Python ML Param and UID issues
- Resolved
- is related to
-
SPARK-10931 PySpark ML Models should contain Param values
- Resolved
- is superceded by
-
SPARK-14392 CountVectorizer Estimator should include binary toggle Param
- Resolved
- relates to
-
SPARK-15009 PySpark CountVectorizerModel should be able to construct from vocabulary list
- Resolved
- links to