Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6116

DataFrame API improvement umbrella ticket (Spark 1.5)

    Details

    • Type: Umbrella
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: SQL
    • Labels:
    • Target Version/s:
    • Sprint:
      Spark 1.5 doc/QA sprint

      Description

      An umbrella ticket for DataFrame API improvements for Spark 1.5.

      SPARK-9576 is the ticket for Spark 1.6.

        Attachments

          Issue Links

          1.
          describe function for summary statistics Sub-task Resolved Andrey Zagrebin  
          2.
          DataFrame.dropna support Sub-task Resolved Reynold Xin  
          3.
          DataFrame.fillna Sub-task Resolved Reynold Xin  
          4.
          Add RDD methods to DataFrame to preserve schema Sub-task Resolved Joseph K. Bradley  
          5.
          SQLContext.implicits should provide automatic conversion for RDD[Row] Sub-task Closed Unassigned  
          6.
          DataFrame.na.replace value support in Scala/Java Sub-task Resolved Reynold Xin  
          7.
          DataFrame.na.replace value support for Python Sub-task Resolved Adrian Wang  
          8.
          SQLContext.emptyDataFrame should contain 0 rows, not 1 row Sub-task Resolved Reynold Xin  
          9.
          SQLContext.registerFunction -> SQLContext.udf.register Sub-task Resolved Davies Liu  
          10.
          Alias DataFrame.na.fill/drop in Python Sub-task Resolved Reynold Xin  
          11.
          Make DataFrame.rdd a lazy val Sub-task Resolved Cheng Lian  
          12.
          Decide on semantics for string identifiers in DataFrame API Sub-task Resolved Reynold Xin  
          13.
          not able to resolve dot('.') in field name Sub-task Resolved Wenchen Fan  
          14.
          Join on two tables (generated from same one) is broken Sub-task Resolved Reynold Xin  
          15.
          Create a DataFrame join API to facilitate equijoin and self join Sub-task Resolved Reynold Xin  
          16.
          Missing alias function on Python DataFrame Sub-task Resolved Yin Huai  
          17.
          Drop __getattr__ on pyspark.sql.DataFrame Sub-task Closed Unassigned  
          18.
          Stabilize Spark SQL data type API followup Sub-task Resolved Reynold Xin  
          19.
          Stabilize data types Sub-task Resolved Reynold Xin  
          20.
          UDF clean up Sub-task Resolved Reynold Xin  
          21.
          Remove PrimitiveType Sub-task Resolved Reynold Xin  
          22.
          Rename NativeType -> AtomicType Sub-task Resolved Reynold Xin  
          23.
          Clean up Python data type hierarchy Sub-task Resolved Davies Liu  
          24.
          Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python Sub-task Resolved Wenchen Fan  
          25.
          Expression for monotonically increasing IDs Sub-task Resolved Reynold Xin  
          26.
          Document what the options: Map[String, String] does on DataFrame.save and DataFrame.saveAsTable Sub-task Closed Unassigned

          0%

          Original Estimate - 2h
          Remaining Estimate - 2h
          27.
          Support math functions in DataFrames Sub-task Resolved Burak Yavuz  
          28.
          Support math functions in DataFrames in Python Sub-task Resolved Burak Yavuz  
          29.
          SQLContext.range() Sub-task Resolved Adrian Wang  
          30.
          Correlation methods for DataFrame Sub-task Closed Burak Yavuz  
          31.
          Add a Column expression for partition ID Sub-task Resolved Reynold Xin  
          32.
          Add randomSplit method to DataFrame Sub-task Resolved Burak Yavuz  
          33.
          Add approximate stratified sampling to DataFrame Sub-task Resolved Xiangrui Meng  
          34.
          collect and take return different results Sub-task Resolved Cheng Hao  
          35.
          Make repartition and coalesce a part of the query plan Sub-task Resolved Burak Yavuz  
          36.
          Support math functions in R DataFrame Sub-task Resolved Qian Huang  
          37.
          Support fillna / dropna in R DataFrame Sub-task Resolved Sun Rui  
          38.
          Create Column expression for array/struct creation Sub-task Resolved Reynold Xin  
          39.
          Add a method for dropping a column in Java/Scala Sub-task Resolved Rakesh Chalasani  
          40.
          withColumn is very slow on dataframe with large number of columns Sub-task Resolved Wenchen Fan  
          41.
          Audit missing Hive functions Sub-task Resolved Reynold Xin  
          42.
          Add Pandas' shift method to the Dataframe API Sub-task Closed Unassigned  
          43.
          Random number generators for DataFrames Sub-task Resolved Burak Yavuz  
          44.
          Add a between function in Column Sub-task Resolved Chen Song  
          45.
          Add bitwise operations to DataFrame DSL Sub-task Resolved Shiti Saxena  
          46.
          Improve the output from DataFrame.show() Sub-task Resolved Chen Song  
          47.
          Add rollup and cube support to DataFrame Java/Scala DSL Sub-task Resolved Cheng Hao  
          48.
          Add Column expression for conditional statements (if, case) Sub-task Resolved Chen Song  
          49.
          Window function support in Scala/Java DataFrame DSL Sub-task Resolved Cheng Hao  
          50.
          Add DataFrame.dropDuplicates Sub-task Resolved Reynold Xin  
          51.
          Move mathfunctions into functions Sub-task Resolved Burak Yavuz  
          52.
          Add coalesce Spark SQL function to PySpark API Sub-task Resolved Olivier Girardot  
          53.
          By default retain group by columns in aggregate Sub-task Resolved Reynold Xin  
          54.
          Provide DataFrame.zip (analog of RDD.zip) to merge two data frames Sub-task Closed Ram Sriharsha  
          55.
          pyspark.sql.types.StructType and Row should implement __iter__() Sub-task Closed Unassigned  
          56.
          pyspark.sql.types.StructType.fromJson() is incorrectly named Sub-task Closed Unassigned  
          57.
          Add drop column to Python DataFrame API Sub-task Resolved Reynold Xin  
          58.
          Break dataframe.py into multiple files Sub-task Resolved Davies Liu  
          59.
          Add explode expression Sub-task Resolved Michael Armbrust  
          60.
          Don't split by dot if within backticks for DataFrame attribute resolution Sub-task Resolved Wenchen Fan  
          61.
          Document all SQL/DataFrame public methods with @since tag Sub-task Resolved Reynold Xin  
          62.
          Document all PySpark SQL/DataFrame public methods with @since tag Sub-task Resolved Davies Liu  
          63.
          DataFrameReader and DataFrameWriter for input/output API Sub-task Resolved Reynold Xin  
          64.
          make explode support struct type Sub-task Closed Unassigned  
          65.
          DataFrame reader/writer API in Python Sub-task Resolved Davies Liu  
          66.
          Figure out what to do with insertInto w.r.t. DataFrameWriter API Sub-task Closed Yin Huai  
          67.
          Add standard deviation aggregate expression Sub-task Closed Unassigned  
          68.
          Add rollup and cube support to DataFrame Python DSL Sub-task Resolved Davies Liu  
          69.
          Window function support in Python DataFrame DSL Sub-task Resolved Davies Liu  
          70.
          Better error for unresolved window functions. Sub-task Resolved Michael Armbrust  
          71.
          DataFrame.ntile() should only accept Int as parameter Sub-task Resolved Davies Liu  
          72.
          Add JavaDoc style deprecation for deprecated DataFrame methods Sub-task Resolved Reynold Xin  
          73.
          Support SQLContext.range(end) Sub-task Resolved Animesh Baranawal  
          74.
          Improve DataFrame Python exception Sub-task Closed Davies Liu  
          75.
          crosstab should use 0 instead of null for pairs that don't appear Sub-task Resolved Reynold Xin  
          76.
          Add methods to facilitate equi-join on multiple join keys Sub-task Resolved Liang-Chi Hsieh  
          77.
          Python DataFrame: support passing a list into describe Sub-task Resolved Amey Chaugule  
          78.
          Improve DataFrame.show() output Sub-task Resolved Akhil Thatipamula  
          79.
          Improve frequent items documentation Sub-task Resolved Burak Yavuz  
          80.
          DataFrameReader/Writer in Python does not match Scala Sub-task Resolved Davies Liu  
          81.
          Add Column.alias to Scala/Java API Sub-task Resolved Reynold Xin  
          82.
          Design an easier way to construct schema for both Scala and Python Sub-task Resolved Ilya Ganelin  
          83.
          Improve Python reader/writer interface doc and testing Sub-task Resolved Reynold Xin  
          84.
          Better AnalysisException for writing DataFrame with identically named columns Sub-task Resolved Animesh Baranawal  
          85.
          DataFrame Python API: Alias replace in DataFrameNaFunctions Sub-task Resolved Reynold Xin  
          86.
          Improve error message reporting for DataFrame and SQL Sub-task Resolved Michael Armbrust  
          87.
          DataFrame hint for broadcast join Sub-task Resolved Reynold Xin  
          88.
          Python DataFrameReader/Writer should mirror scala Sub-task Resolved Cheolsoo Park  
          89.
          Reconcile callUDF and callUdf Sub-task Resolved Benjamin Fradet  
          90.
          Prevent accidental use of "and" and "or" to build invalid expressions in Python Sub-task Resolved Davies Liu  
          91.
          For PySpark's DataFrame API, we need to throw exceptions when users try to use and/or/not Sub-task Resolved Davies Liu  
          92.
          expr function to convert SQL expression into a Column Sub-task Resolved Joseph Batchik  
          93.
          dataframe left joins are not working as expected in pyspark Sub-task Resolved Davies Liu  
          94.
          partitionBy in Python DataFrame reader/writer interface should not default to empty tuple Sub-task Resolved Reynold Xin  
          95.
          Add a "pretty" parameter to show Sub-task Resolved Shixiong Zhu  
          96.
          DataFrame Python API should work with column which has non-ascii character in it Sub-task Resolved Davies Liu  
          97.
          In should not take Any not Column Sub-task Resolved Unassigned  
          98.
          Maintain binary compatibility for in function Sub-task Closed Unassigned  
          99.
          Good errors for invalid input to ExpectsInput expressions Sub-task Resolved Michael Armbrust  
          100.
          Hide JVM stack trace for IllegalArgumentException in Python Sub-task Resolved Liang-Chi Hsieh  
          101.
          Rename inSet to isin to match Pandas function Sub-task Resolved Reynold Xin  

            Activity

              People

              • Assignee:
                rxin Reynold Xin
                Reporter:
                rxin Reynold Xin
              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified