Details
Description
Hi All,
This is one suggestion can we have a feature in pyspark to remove duplicate columns?
I have come up with small code for that
def drop_duplicate_columns(_rdd_df): column_names = _rdd_df.columns duplicate_columns = set([x for x in column_names if column_names.count(x) > 1]) _rdd_df = _rdd_df.drop(*duplicate_columns) return _rdd_df
Your suggestions are appreciatd and can work on this PR, this would be my first contribution(PR) to Pyspark if you guys agree with it