[SPARK-17043] Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.2, 2.0.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

I have a method that adds a row index column to a dataframe. It only works correctly if the dataframe has less than 200 columns. When more than 200 columns nearly all the data becomes empty (""'s for values).

def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = {
    val nullable = false
     df.sparkSession.createDataFrame(
      df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)},
      StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, nullable))
    )
  }

This might be related to https://issues.apache.org/jira/browse/SPARK-16664 but I'm not sure. I saw the 200 column threshold and it made me think it might be related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is fixed in 2.0.1 (have not tried yet). I have no idea why the 200 column threshold is significant.

Attachments

Issue Links

duplicates

SPARK-16664 Spark 1.6.2 - Persist call on Data frames with more than 200 columns is wiping out the data.

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Barry Becker

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 12/Aug/16 17:57

Updated:: 12/Aug/16 18:29

Resolved:: 12/Aug/16 18:29