Details
-
Question
-
Status: Resolved
-
Critical
-
Resolution: Not A Problem
-
2.1.1
-
None
-
None
Description
When using a bagging method like RandomForest, the theory dictates that the source dataset is copied over with a subsample of rows.
To avoid excessive memory usage, Spark uses the BaggedPoint concept where each row is associated to a weight for the final dataset, ie for each tree asked for the RandomForest.
RandomForest requires that the dataset for each tree is a random draw with replacement from the source data, that has the same size as the source data.
However, during investigations, we found out that the count value used to compute the variance is not always equal to the source data count, it is sometimes less, sometimes more.
I went digging in the source and found the BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a Poisson distribution to assign a weight to each row. And this distribution does not guarantee that the total of weights for a given tree is equal to the source dataset count.
Looking around in here, it seems this is done for performance reasons because the approximation it gives is good enough, especially when dealing with very large datasets.
However, I could not find any documentation that clearly explains this. Would you have any link on the subject?