Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11745

[C++] Improve configurability of random data generation

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 4.0.0
    • Component/s: C++

      Description

      arrow::random::RandomArrayGenerator is useful for stress testing and benchmarking. Arrays of primitives can be generated with little boilerplate, however it is cumbersome to specify creation of nested arrays or record batches which are necessary for testing $n column operations such as group_by.

      My ideal API for random generation takes only a FieldVector, a number of rows, and a seed as arguments. Other options (such as minimum, maximum, unique count, null probability, etc) are specified using field metadata so that they can be provided uniformly or granularly as necessary for a given test case:

      auto random_batch = Generate({
        field("i32", int32()), // i32 may take any value between INT_MAX and INT_MIN
                               // and will be null with default probability 0.01
        field("f32", float32(), false), // f32 will be entirely valid
        field("probability", float64(), true, key_value_metadata({
          // custom random generation properties:
          {"min", "0.0"},
          {"max", "1.0"},
          {"null_probability", "0.0001"},
        }),
        field("list_i32", list(
          field("item", int32(), true, key_value_metadata({
            // custom random generation properties can also be specified for nested fields:
            {"min", "0"},
            {"max", "1"},
          })
        )),
      }, num_rows, 0xdeadbeef);
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bkietz Ben Kietzman
                Reporter:
                bkietz Ben Kietzman
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m