Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 1.4.1, Impala 2.2, Impala 2.3.0
Description
Whenever we serialize a row batch, even a row batch with 0 materialized slots, we always allocate an array of tuple_offsets per tuple. That means that there is a serialization overhead of 4B per tuple (per row).
Currently we do not consider this overhead when we calculate the TupleDescriptor::avgSerializedSize_ and consequently the avgRowSize_ which is used for example when we decide which input to broadcast/distribute.
We should take into account this overhead. Such a change may affect plans of queries with small avgRowSize_ or multiple tuples (joins).