Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18541

Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.2
    • 2.2.0
    • PySpark, SQL
    • all

    Description

      In the Scala SQL API, you can pass in new metadata when you alias a field. That functionality is not available in the Python API. Right now, you have to painfully utilize SparkSession.createDataFrame to manipulate the metadata for even a single column. I would propose to add the following method to pyspark.sql.Column:

      def aliasWithMetadata(self, name, metadata):
          """
          Make a new Column that has the provided alias and metadata.
          Metadata will be processed with json.dumps()
          """
          _context = pyspark.SparkContext._active_spark_context
          _metadata_str = json.dumps(metadata)
          _metadata_jvm = _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
          _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
          return Column(_new_java_column)
      

      I can likely complete this request myself if there is any interest for it. Just have to dust off my knowledge of doctest and the location of the python tests.

      Attachments

        Activity

          People

            shea.parkes Shea Parkes
            shea.parkes Shea Parkes
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified