XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.3.0
    • None
    • SQL
    • None
    • Spark 1.5 doc/QA sprint

    Description

      Users can use RDD methods on DataFrames, but they lose the schema and need to reapply it. For RDD methods which preserve the schema (such as randomSplit), DataFrame should provide versions of those methods which automatically preserve the schema.

      Here are a few I'd prioritize (for my use cases!)

      • randomSplit
      • sampleByKey + sampleByKeyExact
        • Q: Should "key" be a single column, or should we support using a set of columns as a key?

      Attachments

        Issue Links

          Activity

            People

              josephkb Joseph K. Bradley
              josephkb Joseph K. Bradley
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: