Details
-
Bug
-
Status: Resolved
-
P1
-
Resolution: Fixed
-
None
Description
The python hash() function is non-deterministic. As a result, different workers will map identical values to different hashes. This leads to overestimation of the number of unique values (by several magnitudes, in my experience x1000) in a distributed processing model.
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218
Attachments
Issue Links
- links to