[SPARK-32562] Pyspark drop duplicate columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: PySpark
Labels:
- newbie
- starter

Flags:

Patch

Description

Hi All,

This is one suggestion can we have a feature in pyspark to remove duplicate columns?

I have come up with small code for that

def drop_duplicate_columns(_rdd_df):
    column_names = _rdd_df.columns
    duplicate_columns = set([x for x in column_names if column_names.count(x) > 1])
    _rdd_df = _rdd_df.drop(*duplicate_columns)
    return _rdd_df

Your suggestions are appreciatd and can work on this PR, this would be my first contribution(PR) to Pyspark if you guys agree with it

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: abhijeet dada mote

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Aug/20 04:11

Updated:: 12/Dec/22 18:10

Resolved:: 11/Aug/20 09:09

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified