Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5153

[Rust] Use IntoIter trait for write_batch/write_mini_batch

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Rust
    • Labels:
      None

      Description

      Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:

      struct MyData {  name: String,  address: Option<String>}

      Over the course of working sets of this data, you'll have the bulk data Vec<MyData>,  the names column in a Vec<&String>, the address column in a Vec<Option<String>>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.

      What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like

        write_batch(bulk.iter().map(|x| x.name), None, None)  write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)

      and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.

      I am writing data with many columns and I think this would really help to speed things up.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              xrl Xavier Lange
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: