Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable.

        Issue Links

          Activity

          Hide
          rxin Reynold Xin added a comment -

          Tomer Kaftan UDFs can be used with a callUDF function, defined in the dsl package object.

          Show
          rxin Reynold Xin added a comment - Tomer Kaftan UDFs can be used with a callUDF function, defined in the dsl package object.
          Hide
          apachespark Apache Spark added a comment -

          User 'rxin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/4241

          Show
          apachespark Apache Spark added a comment - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4241
          Hide
          apachespark Apache Spark added a comment -

          User 'rxin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/4235

          Show
          apachespark Apache Spark added a comment - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4235
          Hide
          tomerk Tomer Kaftan added a comment -

          What are the current plans for how UDFs will fit into this data frames API?

          Show
          tomerk Tomer Kaftan added a comment - What are the current plans for how UDFs will fit into this data frames API?
          Hide
          sandyr Sandy Ryza added a comment -

          Ah, yeah, I hadn't considered that aspect. I definitely agree that making them column-mutable would be very ugly. I'd push for add_column over addColumn, but maybe the ship with those naming conventions on it has already sailed.

          Show
          sandyr Sandy Ryza added a comment - Ah, yeah, I hadn't considered that aspect. I definitely agree that making them column-mutable would be very ugly. I'd push for add_column over addColumn, but maybe the ship with those naming conventions on it has already sailed.
          Hide
          rxin Reynold Xin added a comment -

          I've debating that myself for a while. The main question is whether we want to make SchemaRDD/DataFrame column-mutable. I think it can make certain uses more concise. However, it can also be confusing when the following happens ...

          val df1 = df.map(...)
          df["newCol"] = df["oldCol"] + 1
          df1.map(...)
          

          or

          val df1 = df.as("a")
          df["newCol"] = df["oldCol"] + 1
          df1.join(df) ...
          
          Show
          rxin Reynold Xin added a comment - I've debating that myself for a while. The main question is whether we want to make SchemaRDD/DataFrame column-mutable. I think it can make certain uses more concise. However, it can also be confusing when the following happens ... val df1 = df.map(...) df[ "newCol" ] = df[ "oldCol" ] + 1 df1.map(...) or val df1 = df.as( "a" ) df[ "newCol" ] = df[ "oldCol" ] + 1 df1.join(df) ...
          Hide
          sandyr Sandy Ryza added a comment -

          Would it be possible to keep the Python versions of setColumn and addColumn consistent with Pandas? The naming sticks out to me as unpythonic, and it seems like the migration path we'd like to make appealing is Pandas -> Spark, not Spark Scala -> Spark Python.

          Show
          sandyr Sandy Ryza added a comment - Would it be possible to keep the Python versions of setColumn and addColumn consistent with Pandas? The naming sticks out to me as unpythonic, and it seems like the migration path we'd like to make appealing is Pandas -> Spark, not Spark Scala -> Spark Python.
          Hide
          apachespark Apache Spark added a comment -

          User 'rxin' has created a pull request for this issue:
          https://github.com/apache/spark/pull/4173

          Show
          apachespark Apache Spark added a comment - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4173
          Hide
          hkothari Hamel Ajay Kothari added a comment -

          Thanks for the response Reynold Xin, one more question: how are we planning on allowing the breadth of things that we enabled by expressions with this new API. For example, if I want to do a join where rdd1.colA == rdd2.colB but I want to cast rdd2.colB to String first, how would I do that?

          In the expressions API I could do new EqualTo(colAExpression, Cast(colBExpression, DataType.StringType)) where colAExpression and colBExpression are resolved NamedExpressions. How would this look in the new API?

          I'm happy to take these questions elsewhere if there is a better place to ask. Thanks for your help!

          Show
          hkothari Hamel Ajay Kothari added a comment - Thanks for the response Reynold Xin , one more question: how are we planning on allowing the breadth of things that we enabled by expressions with this new API. For example, if I want to do a join where rdd1.colA == rdd2.colB but I want to cast rdd2.colB to String first, how would I do that? In the expressions API I could do new EqualTo(colAExpression, Cast(colBExpression, DataType.StringType)) where colAExpression and colBExpression are resolved NamedExpressions. How would this look in the new API? I'm happy to take these questions elsewhere if there is a better place to ask. Thanks for your help!
          Hide
          rxin Reynold Xin added a comment -

          Hamel Ajay Kothari that is correct. It will be trivially doable to select columns at runtime.

          For the 2nd one, not yet. That's a very good point. You can always do an extra projection. We will try to add it, if not in the 1st iteration, then in the 2nd iteration.

          Show
          rxin Reynold Xin added a comment - Hamel Ajay Kothari that is correct. It will be trivially doable to select columns at runtime. For the 2nd one, not yet. That's a very good point. You can always do an extra projection. We will try to add it, if not in the 1st iteration, then in the 2nd iteration.
          Hide
          hkothari Hamel Ajay Kothari added a comment -

          Am I correct in interpreting that this would allow us to trivially select columns at runtime since we'd just use SchemaRDD(stringColumnName)? In the world of catalyst selecting columns known only at runtime was a real pain because the only defined way to do it in the docs was to use quasiquotes or use SchemaRDD.baseLogicalPlan.resolve(). The first couldn't be defined at runtime (as far as I know) and the second required you to depend on expressions.

          Also, is there any way to control the name of the resulting columns from groupby+aggregate (or similar methods that add columns) in this plan?

          Show
          hkothari Hamel Ajay Kothari added a comment - Am I correct in interpreting that this would allow us to trivially select columns at runtime since we'd just use SchemaRDD(stringColumnName) ? In the world of catalyst selecting columns known only at runtime was a real pain because the only defined way to do it in the docs was to use quasiquotes or use SchemaRDD.baseLogicalPlan.resolve() . The first couldn't be defined at runtime (as far as I know) and the second required you to depend on expressions. Also, is there any way to control the name of the resulting columns from groupby+aggregate (or similar methods that add columns) in this plan?
          Hide
          mohitjaggi Mohit Jaggi added a comment -

          minor comment: mutate existing can do
          df("x") = df("x") ....

          Show
          mohitjaggi Mohit Jaggi added a comment - minor comment: mutate existing can do df("x") = df("x") ....
          Hide
          rxin Reynold Xin added a comment -

          Mohit Jaggi thanks for commenting. The implementation is actually pretty minor (it is mostly about finalizing the API). It would be great if you can review the design doc and chime in, and later on also review my initial pull request. Once the first pull request is in, I'm sure we will have more splittable tasks.

          Show
          rxin Reynold Xin added a comment - Mohit Jaggi thanks for commenting. The implementation is actually pretty minor (it is mostly about finalizing the API). It would be great if you can review the design doc and chime in, and later on also review my initial pull request. Once the first pull request is in, I'm sure we will have more splittable tasks.
          Hide
          mohitjaggi Mohit Jaggi added a comment -

          Hi,
          This is Mohit Jaggi, author of https://github.com/AyasdiOpenSource/bigdf
          Matei had suggested integrating bigdf with SchemaRDD and I was planning on doing that soon.
          I would love to contribute to this item. Most of the constructs mentioned in the design document already exist in bigdf.

          Mohit.

          Show
          mohitjaggi Mohit Jaggi added a comment - Hi, This is Mohit Jaggi, author of https://github.com/AyasdiOpenSource/bigdf Matei had suggested integrating bigdf with SchemaRDD and I was planning on doing that soon. I would love to contribute to this item. Most of the constructs mentioned in the design document already exist in bigdf. Mohit.

            People

            • Assignee:
              rxin Reynold Xin
              Reporter:
              rxin Reynold Xin
            • Votes:
              4 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development