Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10030

[Rust] Support fromIter and toIter

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.0.0
    • Rust

    Description

      Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing

      (dump of the document above)

      Rust Arrow supports two main computational models:

      1. Batch Operations, that leverage some form of vectorization
      2. Element-by-element operations, that emerge in more complex operations

      This document concerns element-by-element operations, that are common outside of the library (and sometimes in the library).

      Element-by-element operations

      These operations are programmatically written as:

      1. Downcast the array to its specific type
      2. Initialize buffers
      3. Iterate over indices and perform the operation, appending to the buffers accordingly
      4. Create ArrayData with the required null bitmap, buffers, childs, etc.
      5. return ArrayRef from ArrayData

       

      We can split this process in 3 parts:

      1. Initialization (1 and 2)
      2. Iteration (3)
      3. Finalization (4 and 5)

      Currently, the API that we offer to our users is:

      1. as_any() to downcast the array based on its DataType
      2. Builders for all types, that users can initialize, matching the downcasted array
      3. Iterate
        1. Use for i in (0..array.len())
        2. Use Array::value and Array::is_valid/is_null
        3. use builder.append_value(new_value) or builder.append_null()
      4. Finish the builder and wrap the result in an Arc

      This API has some issues:

      1. value is unsafe, even though it is not marked as such
      2. builders are usually slow due to the checks that they need to perform
      3. The API is not intuitive

      Proposal

      This proposal aims at improving this API in 2 specific ways:

      • Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
      • Implement FromIterator<Item=T> and Item=Option<T>

      so that users can write:

      // incoming array
      let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
      let array = Arc::new(array) as ArrayRef;
      let array = array.as_any().downcast_ref::<Int32Array>().unwrap();
      
      // to and from iter, with a +1
      let result: Int32Array = array
          .iter()
          .map(|e| if let Some(r) = e { Some(r + 1) } else { None })
          .collect();
      
      let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
      
      assert_eq!(result, expected);
      

       

      This results in an API that is:

      1. efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator
      2. Safe, as it does not allow segfaults
      3. Simple, as users do not need to worry about Builders, buffers, etc, only native Rust.

      Attachments

        Issue Links

          Activity

            People

              jorgecarleitao Jorge Leitão
              jorgecarleitao Jorge Leitão
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h