Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-5707

Support offset reset strategy w/ spark streaming read from hudi table

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • reader-core
    • None

    Description

      For users reading hudi table in a streaming manner, we need to support offset reset strategy if the commit of interest it archived or cleaned up. 

       

      notes from the issue 

      In streaming read, user might want to get all incremental changes. from what I see, this is nothing but an incremental query on a hudi table. w/ incremental query, we do have fallback mechanism via hoodie.datasource.read.incr.fallback.fulltablescan.enable.

      But in streaming read, the amount of data read might spike up(if we plan to do the same) and the user may not have provisioned higher resources for the job.

      I am thinking, if we should add something like auto.offset.reset we have in kafka. If you know if we have something similar in streaming read from spark itself, we can leverage the same or add a new config in hoodie.

      So, users can configure what they want to do in such cases:

      1. whether they wish to resume reading from earliest valid commit from hudi.
        // impl might be involved. since we need to dedect the commit which hasn't been cleaned by the cleaner yet.
      2. Or do snapshot query w/ latest table state.
      3. Fail the streaming read.
      4.  

      Attachments

        Issue Links

          Activity

            People

              kazdy kazdy
              shivnarayan sivabalan narayanan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: