Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-33

Benchmark the assembly of thrift objects, and possibly create a more efficient ReplayingTProtocol

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • parquet-mr
    • None

    Description

      The current implementation of parquet thrift creates an instance of TProtocol for each value of each record and builds a stack of these events, which are then replayed back to the TBase.

      I'd be curious to benchmark this, and if it's slow, try building a "ReplayingTProtocol" that instead of having a stack of TProtocol instances, contains a primitive array of each type. As events are fed into this replaying TProtocol, it would just add these primitives to its buffers, and then the TBase would drain them. This would effectively let us stream the values into the TBase without making an object allocation for each value.

      The buffers could be set to a certain size, and if they fill up (which they sholdn't in most cases), the TBase could begin draining the protocol until it is empty again, at which point the TProtocol can block the TBase from draining further while the parque record assembly feeds it more events.

      This is all moot if it turns out not to be bottleneck though

      Attachments

        Activity

          People

            Unassigned Unassigned
            alexlevenson Alex Levenson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: