Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5123

[Rust] derive RecordWriter from struct definitions

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Rust

      Description

      Migrated from previous github issue (which saw a lot of comments but at a rough transition time in the project): https://github.com/sunchao/parquet-rs/pull/197

       

      Goal

      ===

      Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

      How to Use
      ===

      ```
      extern crate parquet;
      #[macro_use] extern crate parquet_derive;

      #[derive(ParquetRecordWriter)]
      struct ACompleteRecord<'a>

      {   pub a_bool: bool,   pub a_str: &'a str, }

      ```

      RecordWriter trait
      ===

      This is the new trait which `parquet_derive` will implement for your structs.

      ```
      use super::RowGroupWriter;

      pub trait RecordWriter<T>

      {   fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); }

      ```

      How does it work?
      ===

      The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

      ```
      [lib]
      proc-macro = true
      ```

      The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

       - the name of the struct
       - the lifetime variables of the struct
       - the fields of the struct

      The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

      The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

      ```
      impl RecordWriter for $struct_name {
        fn write_row_group(..) {
          $(

      {       $column_writer_snippet     }

      )
        }
      }
      ```

      this template is then added under the struct definition, ending up something like:

      ```
      struct MyStruct {
      }
      impl RecordWriter for MyStruct {
        fn write_row_group(..) {
         

      {        write_col_1();     }

      ;
        

      {        write_col_2();    }

        }
      }
      ```

      and finally THIS is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

      Viewing the Derived Code
      ===

      To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

      ```
      $WORK_DIR/parquet-rs/parquet_derive_test
      cargo expand --lib > ../temp.rs
      ```

      then you can dump the contents:

      ```
      struct DumbRecord

      {     pub a_bool: bool,     pub a2_bool: bool, }

      impl RecordWriter<DumbRecord> for &[DumbRecord] {
          fn write_to_row_group(
              &self,
              row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
          ) {
              let mut row_group_writer = row_group_writer;
              {
                  let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
                  let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
                  if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                      column_writer
                 

      {                 typed.write_batch(&vals[..], None, None).unwrap();             }
                  row_group_writer.close_column(column_writer).unwrap();
              };
              {
                  let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
                  let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
                  if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                      column_writer
                  {                 typed.write_batch(&vals[..], None, None).unwrap();             }

                  row_group_writer.close_column(column_writer).unwrap();
              }
          }
      }
      ```

      now I need to write out all the combinations of types we support and make sure it writes out data.

      Procedural Macros
      ===

      The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

      The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

      I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

      Potentials For Better Design
      ===

       - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
       - [X] ~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~ (not tackling in this generation of the plugin)
       - [X] ~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
       - [X] ~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~ (moved to #203)

      Status
      ===

      I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

      I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                xrl Xavier Lange
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 10m
                  8h 10m