XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.3.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      Users can use RDD methods on DataFrames, but they lose the schema and need to reapply it. For RDD methods which preserve the schema (such as randomSplit), DataFrame should provide versions of those methods which automatically preserve the schema.

      Here are a few I'd prioritize (for my use cases!)

      • randomSplit
      • sampleByKey + sampleByKeyExact
        • Q: Should "key" be a single column, or should we support using a set of columns as a key?

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                josephkb Joseph K. Bradley
                Reporter:
                josephkb Joseph K. Bradley
              • Votes:
                2 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: