Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38004

read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.2.0
    • None
    • PySpark
    • None

    Description

      mangle_dupe_cols - default is True
      So ideally it should have handled duplicate columns, but in case the columns are case sensitive it fails as below.

      AnalysisException: Reference 'Sheet.col' is ambiguous, could be Sheet.col, Sheet.col.

      Where two columns are Col and cOL

      In the best practices, there is a mention of not to use case sensitive columns - https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names

      Either the docs for read_excel/mangle_dupe_cols have to be updated about this or it has to be handled.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Saikrishna_Pujari Saikrishna Pujari
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: