Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32562

Pyspark drop duplicate columns

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • PySpark
    • Patch

    Description

      Hi All,

      This is one suggestion can we have a feature in pyspark to remove duplicate columns? 

      I have come up with small code for that 

      def drop_duplicate_columns(_rdd_df):
          column_names = _rdd_df.columns
          duplicate_columns = set([x for x in column_names if column_names.count(x) > 1])
          _rdd_df = _rdd_df.drop(*duplicate_columns)
          return _rdd_df
      
      

      Your suggestions are appreciatd and can work on this PR, this would be my first contribution(PR) to Pyspark if you guys agree with it

      Attachments

        Activity

          People

            Unassigned Unassigned
            amotex abhijeet dada mote
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified