Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6377

[C++] Extending STL API to support row-wise conversion



    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: C++
    • Labels:


      Using array builders is the recommended way in the documentation for converting rowwise data to arrow tables currently. However, array builders has a low level interface to support various use cases in the library. They require additional boilerplate due to type erasure, although some of these boilerplate could be avoided in compile time if the schema is already known and fixed (also discussed in ARROW-4067).

      In some other part of the library, STL API provides a nice abstraction over builders by inferring data type and builders from values provided, reducing the boilerplate significantly. It handles automatically converting tuples with a limited set of native types currently: numeric types, string and vector (+ nullable variations of these in case ARROW-6326 is merged). It also allows passing references in tuple values (implemented recently in ARROW-6284).

      As a more concrete example, this is the code which can be used to convert row_data provided in examples:

      arrow::Status VectorToColumnarTableSTL(const std::vector<struct data_row>& rows,
                                             std::shared_ptr<arrow::Table>* table) {
          auto rng = rows | ranges::views::transform([](const data_row& row) {
                         return std::tuple<int, double, const std::vector<double>&>(
                             row.id, row.cost, row.cost_components);
          return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
                                                 {"id", "cost", "cost_components"},

      So, it allows more concise code for consumers of the API compared to using builders directly.

      There is no direct support by the library for other types (binary, struct, union etc. types or converting iterable objects other than vectors to lists). Users are provided a way to specialize their own data structures. One limitation for implicit inference is that it is hard (or even impossible) to infer exact type to use in some cases. For example, should std::string_view value be inferred as string, binary, large binary or list? This ambiguity can be avoided by providing some way for user to explicitly state correct type for storing a column. For example a user can return a so called BinaryCell class to return binary values.

      Proposed changes:

      • Implementing cell "adapters": Cells are non-owning references for each type. It's user's responsibility keep pointed values alive. (Can scalars be used in this context?)
        • BinaryCell
        • StringCell
        • ListCell (fo adapting any Range)
        • StructCell
        • ...
      • Primitive types don't need such adapters since their values are trivial to cast (e.g. just use int8_t(value) to use Int8Type).
      • Adding benchmarks for comparing with builder performance. There is likely to be some performance penalty due to hindering compiler optimizations. Yet, this is acceptable in exchange of a more concise code IMHO. For fine-grained control over performance, it will be still possible to directly use builders.

      I have implemented something similar to BinaryCell for my use case. If above changes sound reasonable, I will go ahead and start implementing other cells to submit.







            • Assignee:
              ozars Omer Ozarslan
            • Votes:
              0 Vote for this issue
              1 Start watching this issue


              • Created: