Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9366

[C#] BinaryArray.Builder Reserve/Resize methods are broken

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.17.1
    • None
    • C#
    • None

    Description

      Summary

      The Reserve() and Resize() methods on BinaryArray.Builder are broken, insofar as the current implementation does not perform the intended function and may corrupt the contents of any array built so far:

                  public TBuilder Reserve(int capacity)
                  {
                      ValueOffsets.Reserve(capacity + 1);
                      ValueBuffer.Reserve(capacity); // Meaningless
                      ValidityBuffer.Reserve(capacity + 1); // The +1 is not needed
                      return Instance;
                  }
      
                  public TBuilder Resize(int length)
                  {
                      ValueOffsets.Resize(length + 1);
                      ValueBuffer.Resize(length); // Dangerous!
                      ValidityBuffer.Resize(length + 1); // The +1 is not needed
                      return Instance;
                  }
      

      In the case of Reserve(), the capacity parameter is expected to refer to the desired number of items for which the array builder must have capacity. However, it is almost certainly meaningless to call this for ValueBuffer, as this holds variable-length items and each item is very likely to be longer than one byte.

      In the case of Resize(), the current implementation is dangerous, because:

      1. Resizing down will very likely truncate a variable-length value item somewhere in the middle.
      2. Resizing up will very likely truncate a variable-length value item somewhere too, as most value items are expected to be multiple bytes long, and hence an index equal to the parameter length is likely to be "back" in some earlier value instead.

      Suggested Fix

      There are two broad approaches:

      1. Make existing functions safe:
        • Reserve() may assume an average length of each value in order to make a sensible memory reservation in the ValueBuffer.
        • Resizing down must not truncate values in the middle but find the appropriate offset for the new smaller number of items and truncate there instead.
        • Resizing up must add valid value offsets, either with default empty or null values added to pad the array to the desired size.
      2. Remove the functions from the interface:
        • One could decide that Reserve() and Resize() only make sense for fixed-size values and change interfaces accordingly.

      I have a mild personal preference for the former in the short term, mainly because it allows the existing interface to behave in a sensible way without entailing a larger-scale refactoring.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mr_smidge Adam Szmigin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: