[SPARK-12874] ML StringIndexer does not protect itself from column name duplication - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.2, 1.6.0
Fix Version/s: 1.6.1, 2.0.0
Component/s: ML
Labels:
None

Description

StringIndexerModel, when performing transform() does not check the schema of the input DataFrame. Because of that, it is possible to create a DataFrame containing columns with duplicated names.

This issue is similar to ~~SPARK-12711~~. StringIndexer could make use of transformSchema to assure that the input DataFrame schema is correct in sense of the parameters' values.

Please confirm. Then, I'll prepare a PR to resolve the bug.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147

Attachments

Issue Links

links to

[Github] Pull Request #11370 (yu-iskw)

Activity

People

Assignee:: Yu Ishikawa

Reporter:: Wojciech Jurczyk

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Jan/16 07:35

Updated:: 27/Feb/16 04:04

Resolved:: 25/Feb/16 21:21