Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
2.3.0
-
None
-
None
Description
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
df = dataset.select("*", rand(seed).alias(randCol))
Should add
df.checkpoint()
If df is not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be recomputed, thus random number could be different.
This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below.
https://dzone.com/articles/non-deterministic-order-for-select-with-limit