Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1676

[C++] Correctly truncate oversized validity bitmaps when writing Feather format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.7.1
    • 0.8.0
    • C++

    Description

      An extra 0 appears in the beginning when serializing and deserializing an array with more than 128 values and at least one NULL value using Feather. Once the extra 0 is inserted a value is trimmed at the end.

      Here is the C++ code to write such an array:

      #include <iostream>
      #include <arrow/api.h>
      #include <arrow/io/file.h>
      #include <arrow/ipc/feather.h>
      #include <arrow/pretty_print.h>
      
      int main() {
        // 1. Build Array
        arrow::DoubleBuilder builder;
        for (int i = 0; i < 129; i++)
            if (i == 0)
                builder.AppendNull();
            else
                builder.Append(i);
      
        std::shared_ptr<arrow::Array> array;
        builder.Finish(&array);
      
        arrow::PrettyPrint(*array, 0, &std::cout);
        std::cout << std::endl;
      
        // 2. Write to Feather file
        std::shared_ptr<arrow::io::FileOutputStream> stream;
        arrow::io::FileOutputStream::Open("out.f", false, &stream);
      
        std::unique_ptr<arrow::ipc::feather::TableWriter> writer;
        arrow::ipc::feather::TableWriter::Open(stream, &writer);
      
        writer->SetNumRows(129);
        writer->Append("id", *array);
      
        writer->Finalize();
        stream->Close();
      
        return 0;
      }
      

      The output of running this code is:

      # g++-4.9 -std=c++11 example.cpp -larrow && ./a.out
      [null, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]
      

      The array is deserialized in Python and looks like this:

      >>> pandas.read_feather('out.f')
              id
      0      NaN
      1      0.0
      2      1.0
      3      2.0
      4      3.0
      5      4.0
      6      5.0
      7      6.0
      8      7.0
      9      8.0
      10     9.0
      11    10.0
      12    11.0
      13    12.0
      14    13.0
      15    14.0
      16    15.0
      17    16.0
      18    17.0
      19    18.0
      20    19.0
      21    20.0
      22    21.0
      23    22.0
      24    23.0
      25    24.0
      26    25.0
      27    26.0
      28    27.0
      29    28.0
      ..     ...
      99    98.0
      100   99.0
      101  100.0
      102  101.0
      103  102.0
      104  103.0
      105  104.0
      106  105.0
      107  106.0
      108  107.0
      109  108.0
      110  109.0
      111  110.0
      112  111.0
      113  112.0
      114  113.0
      115  114.0
      116  115.0
      117  116.0
      118  117.0
      119  118.0
      120  119.0
      121  120.0
      122  121.0
      123  122.0
      124  123.0
      125  124.0
      126  125.0
      127  126.0
      128  127.0
      
      [129 rows x 1 columns]
      

      Notice the 0.0 value on index 1. The value should have been 1.0. Also, the last value is 127.0 instead of 128.0.

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              rvernica Rares Vernica
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: