Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11745

[C++] Improve configurability of random data generation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 4.0.0
    • C++

    Description

      arrow::random::RandomArrayGenerator is useful for stress testing and benchmarking. Arrays of primitives can be generated with little boilerplate, however it is cumbersome to specify creation of nested arrays or record batches which are necessary for testing $n column operations such as group_by.

      My ideal API for random generation takes only a FieldVector, a number of rows, and a seed as arguments. Other options (such as minimum, maximum, unique count, null probability, etc) are specified using field metadata so that they can be provided uniformly or granularly as necessary for a given test case:

      auto random_batch = Generate({
        field("i32", int32()), // i32 may take any value between INT_MAX and INT_MIN
                               // and will be null with default probability 0.01
        field("f32", float32(), false), // f32 will be entirely valid
        field("probability", float64(), true, key_value_metadata({
          // custom random generation properties:
          {"min", "0.0"},
          {"max", "1.0"},
          {"null_probability", "0.0001"},
        }),
        field("list_i32", list(
          field("item", int32(), true, key_value_metadata({
            // custom random generation properties can also be specified for nested fields:
            {"min", "0"},
            {"max", "1"},
          })
        )),
      }, num_rows, 0xdeadbeef);
      

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              bkietz Ben Kietzman
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m