Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1082

Bug in deciding the upsert/insert buckets

    XMLWordPrintableJSON

Details

    Description

      In [UpsertPartitioner|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java], when getPartition(Object key) is called, the logic to determine where the record to be inserted is relying on globalInsertCounts where as this should be perPartitionInsertCount.

       

      Bcoz, the weights for all targetInsert buckets are determined based on total Inserts going into the partition of interest. // check like 200. Whereas when getPartition(key) is called, we use global insert count to determine the right bucket. 

       

      For instance,

      P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be inserted is 100.

      P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be inserted is 10025. 

      So, ideal allocation is

      P1: B0 -> 20, B1 -> 50, B2 -> 30

      P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503

       

      getPartition() for a key is determined based on following.

      mod (hash value, inserts)/ inserts.

      Instead of considering inserts for the partition of interest, currently we take global insert counts.

      Lets say, these are the hash values for insert records in P1.

      5, 14, 20, 25, 90, 500, 1001, 5180.

      record hash | expected bucket in P1 | actual bucket in P1 |

      5     | B0 | B0

      14   | B0 | B0

      21   | B1  | B0

      30  | B1 | B0

      90 | B2 | B0

      500 | B0 | B0

      1490 | B2 | B1

      10019 | B0 | B3

       

       

       

       

       

       

       

       

       

       

      Attachments

        Issue Links

          Activity

            People

              shenhong shenh062326
              shivnarayan sivabalan narayanan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: