Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-2299

The log format DELETE block lose the info orderingVal

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Duplicate
    • None
    • 0.11.0
    • Common Core
    • None
    • 4

    Description

      The append handle now always write data block first then delete block, and the delete block only keeps the hoodie keys, when reading, the scanner just read the DELETE block without any info of ordering value, thus, if the we write two records:

      insert:

      {id: 0, ts: 2}

      delete:

      {id: 0, ts: 1}

      Finally the insert message is deleted !!!, this is a critical bug for streaming write, we should fix it as soon as possible

      Here is the discussion on slack:

      Danny Chan 12:42 PM
      https://issues.apache.org/jira/browse/HUDI-2299
      12:43
      Hi, @vc, our user found a critical bug for MOR log format, if there are disorder DELETEs in the streaming messages, the event time of the DELETEs are totally ignored.
      12:44
      I guess this should be a blocker of 0.9 because it affect the correctness of the data set.

      vc 12:44 PM
      if we can fix it by end of day friday PST
      12:44
      we can add it
      12:44
      Just want to cut a release this week.
      12:45
      Do you have a sense for the fix? bandwidth to take it up?

      Danny Chan 12:46 PM
      I try to fix it but can not figure out a good way, if the DELETE block records the orderingVal, the format breaks the compatibility.

      vc 1:05 PM
      We can version the format. thats doable. Should we precombine before even logging the deeltes?

      Danny Chan 1:11 PM
      Yes, we should

      vc 1:26 PM
      I think, thats how its working today. Deletes don't have an ordering val per se, right
      1:28
      Delete block at t1 :
      delete key k
      Data block at t2 :
      ins key k with ordering val 2
      We can just fix it so that the insert shows up, since t2 > t1.
      For what kind of functionality you need, we need to do soft deletes i.e updates with an ordering value instead of hard deletes
      1:28
      makes sense?

      Danny Chan 1:32 PM
      we can but that’s not the perfect solution, especially if the dataset comes from a CDC source, for example the MySQL binlog. There is no extra flag in schema for soft delete though.
      1:37
      In my opinion, it is not about soft DELETE or hard DELETE, even if we do a soft DELETE, the event time (orderingVal) is still important for consumers for versoning. (edited)

      vc 1:57 PM
      tbh, I don't see us fixing this in two days
      1:58
      lets do a 0.9.1 after this ?
      1:58
      shortly after with a bunch of bug fixes and the large pending PRs
      1:58
      we can even make it 0.10.0

      Danny Chan 1:58 PM
      Yes, the cut time is very soon. We can move the fix to next version.

      vc 1:59 PM
      We have some inconsistent semantics in places
      1:59
      some are commit time (arrival time) based and some are orderingVal (event time) based
      2:00
      In the meantime, see HoodieDeleteBlockVersion you can just define a new version for delete block alone for e,g
      2:00
      and add more information

      Attachments

        Issue Links

          Activity

            People

              alexey.kudinkin Alexey Kudinkin
              danny0405 Danny Chen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: