Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12874

ML StringIndexer does not protect itself from column name duplication

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.2, 1.6.0
    • 1.6.1, 2.0.0
    • ML
    • None

    Description

      StringIndexerModel, when performing transform() does not check the schema of the input DataFrame. Because of that, it is possible to create a DataFrame containing columns with duplicated names.

      This issue is similar to SPARK-12711. StringIndexer could make use of transformSchema to assure that the input DataFrame schema is correct in sense of the parameters' values.

      Please confirm. Then, I'll prepare a PR to resolve the bug.

      https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yuu.ishikawa@gmail.com Yu Ishikawa Assign to me
            wjur Wojciech Jurczyk
            Votes:
            0 Vote for this issue
            Watchers:
            4 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment