Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing
(dump of the document above)
Rust Arrow supports two main computational models:
- Batch Operations, that leverage some form of vectorization
- Element-by-element operations, that emerge in more complex operations
This document concerns element-by-element operations, that are common outside of the library (and sometimes in the library).
Element-by-element operations
These operations are programmatically written as:
- Downcast the array to its specific type
- Initialize buffers
- Iterate over indices and perform the operation, appending to the buffers accordingly
- Create ArrayData with the required null bitmap, buffers, childs, etc.
- return ArrayRef from ArrayData
We can split this process in 3 parts:
- Initialization (1 and 2)
- Iteration (3)
- Finalization (4 and 5)
Currently, the API that we offer to our users is:
- as_any() to downcast the array based on its DataType
- Builders for all types, that users can initialize, matching the downcasted array
- Iterate
- Use for i in (0..array.len())
- Use Array::value
and Array::is_valid
/is_null
- use builder.append_value(new_value) or builder.append_null()
- Finish the builder and wrap the result in an Arc
This API has some issues:
- value
is unsafe, even though it is not marked as such
- builders are usually slow due to the checks that they need to perform
- The API is not intuitive
Proposal
This proposal aims at improving this API in 2 specific ways:
- Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
- Implement FromIterator<Item=T> and Item=Option<T>
so that users can write:
// incoming array let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]); let array = Arc::new(array) as ArrayRef; let array = array.as_any().downcast_ref::<Int32Array>().unwrap(); // to and from iter, with a +1 let result: Int32Array = array .iter() .map(|e| if let Some(r) = e { Some(r + 1) } else { None }) .collect(); let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); assert_eq!(result, expected);
This results in an API that is:
- efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator
- Safe, as it does not allow segfaults
- Simple, as users do not need to worry about Builders, buffers, etc, only native Rust.
Attachments
Issue Links
- links to