Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5282

Rationalize record batch sizes in all readers and operators

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10.0
    • None
    • None
    • None

    Description

      Drill uses record batches to process data. A record batch consists of a "bundle" of vectors that, combined, hold the data for some number of records.

      The key consideration for a record batch is memory consumed. Various operators and readers have vastly different ideas of the size of a batch. The text reader can produce batches of 100s of K, while the flatten operator produces batches of half a GB. Other operators are randomly in between. Some readers produce batches of unlimited size driven by average row width.

      Another key consideration is record count. Batches have a hard physical limit of 64K (the number indexed by a two-byte selection vector.) Some operators produce this much, others far less. In one case, we saw a reader that produced 64K+1 records.

      A final consideration is the size of individual vectors. Drill incurs severe memory fragmentation when vectors grow above 16 MB.

      In some cases, operators (such as the Parquet reader) allocate large batches, but only partially fill them, creating a large amount of wasted space. That space adds up when we must buffer it during a sort.

      This ticket asks to research an optimal batch size. Create a framework to build such batches. Retrofit all operators that produce batches to use that framework to produce uniform batches.

      Attachments

        Activity

          People

            paul-rogers Paul Rogers
            paul-rogers Paul Rogers
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: