Users can use RDD methods on DataFrames, but they lose the schema and need to reapply it. For RDD methods which preserve the schema (such as randomSplit), DataFrame should provide versions of those methods which automatically preserve the schema.
Here are a few I'd prioritize (for my use cases!)
- sampleByKey + sampleByKeyExact
- Q: Should "key" be a single column, or should we support using a set of columns as a key?