Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10303

[Rust] Parallel type transformation in CSV reader

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Closed
    • Minor
    • Resolution: Feedback Received
    • None
    • None
    • Rust

    Description

      Currently, when the CSV file is read, a single thread is responsible for reading the file and for transformation of returned string values into correct data types.

      In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 seconds. Out of this time, only ~10% of this is reading the file,  and ~68% is transformation of the string values into correct data types.

      My proposal is to parallelize the part responsible for the data type transformation.

      It seems to be quite simple to achieve since after the CSV reader reads a batch, all projected columns are transformed one by one using an iterator over vector and a map function afterwards. I believe that if one uses the rayon crate, the only change will be the adjustment of "iter()" into "par_iter()" and

      changing

      impl<R: Read> Reader<R>

      into:

      impl<R: Read + std::marker::Sync> Reader<R>

       

      But maybe I oversee something crucial (as being quite new in Rust and Arrow). Any advise from someone experienced is therefore very welcome!

      Attachments

        1. tracing.png
          609 kB
          Sergej Fries

        Issue Links

          Activity

            People

              Unassigned Unassigned
              seryj Sergej Fries
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: