Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2243

CFile Reader improvements

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • cfile
    • None

    Description

      I've done a pretty thorough review of all the CFile reader code over the last few days in order to make a targeted bug fix, and I've got some ideas for how we can simplify it. I'd like to get others thoughts.

      • To reduce confusion between CFile data blocks and FS manager blocks, I think we should change all references in code and docs of CFile data blocks to 'cblock'.
      • Much of the complexity of the CFileIterator is due to it's complex public API, which requires separate Seek(idx) -> Prepare(nrows) -> Scan(output buf, predicates) calls. Additionally, the Prepare step can materialize many blocks, which then need to be put in a queue. I think all of this could be simplified by changing the API to be Seek(idx) -> Scan(nrows, output buf, predicates), and have the CFile iterator only cache the most-recently-materialized block (instead of the queue). For really big scan batches, this will change the internal scan/materialize pattern from materializing all cblocks up front then copying, to materializing and copying of cblocks being interleaved. Since in most cases cblocks are usually much bigger (256kib) than scan batches (100 cells), I think it won't actually lead to measurably different behavior.
      • QueueCurrentDataBlock and ReadCurrentDataBlock should drop Current.

      Attachments

        Activity

          People

            Unassigned Unassigned
            danburkert Dan Burkert
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: