Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.17.1
-
None
-
None
Description
Summary
The Reserve() and Resize() methods on BinaryArray.Builder are broken, insofar as the current implementation does not perform the intended function and may corrupt the contents of any array built so far:
public TBuilder Reserve(int capacity) { ValueOffsets.Reserve(capacity + 1); ValueBuffer.Reserve(capacity); // Meaningless ValidityBuffer.Reserve(capacity + 1); // The +1 is not needed return Instance; } public TBuilder Resize(int length) { ValueOffsets.Resize(length + 1); ValueBuffer.Resize(length); // Dangerous! ValidityBuffer.Resize(length + 1); // The +1 is not needed return Instance; }
In the case of Reserve(), the capacity parameter is expected to refer to the desired number of items for which the array builder must have capacity. However, it is almost certainly meaningless to call this for ValueBuffer, as this holds variable-length items and each item is very likely to be longer than one byte.
In the case of Resize(), the current implementation is dangerous, because:
- Resizing down will very likely truncate a variable-length value item somewhere in the middle.
- Resizing up will very likely truncate a variable-length value item somewhere too, as most value items are expected to be multiple bytes long, and hence an index equal to the parameter length is likely to be "back" in some earlier value instead.
Suggested Fix
There are two broad approaches:
- Make existing functions safe:
- Reserve() may assume an average length of each value in order to make a sensible memory reservation in the ValueBuffer.
- Resizing down must not truncate values in the middle but find the appropriate offset for the new smaller number of items and truncate there instead.
- Resizing up must add valid value offsets, either with default empty or null values added to pad the array to the desired size.
- Remove the functions from the interface:
- One could decide that Reserve() and Resize() only make sense for fixed-size values and change interfaces accordingly.
I have a mild personal preference for the former in the short term, mainly because it allows the existing interface to behave in a sensible way without entailing a larger-scale refactoring.