Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
If n is provided, compute nrow(.data), and if that is not NA, convert to a
{prop}. (Might want to do prop + .01 or something and then do head after, i.e. sample more than you need and then take n, just so you don't by randomness get fewer than n.)
With prop, turn this into filter(arrow_random() < prop). See ARROW-17572.
Defer weight_by to a followup. It should be doable but might be expensive (need to scan everything to compute sum and ensure that all values are positive).
Defer replace = TRUE.
Also probably can only do if .data is ungrouped, I think the dplyr methods do sampling within groups.