Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-10824

Hash in stats.ApproximateUniqueCombineFn NON-deterministic

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P1
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: sdk-py-core
    • Labels:

      Description

      The python hash() function is non-deterministic. As a result, different workers will map identical values to different hashes. This leads to overestimation of the number of unique values (by several magnitudes, in my experience x1000) in a distributed processing model. 

      https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L218

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                monicadsong Monica Song
                Reporter:
                monicadsong Monica Song
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Time Spent - 21h Remaining Estimate - 3h
                  3h
                  Logged:
                  Time Spent - 21h Remaining Estimate - 3h
                  21h