Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17043

Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.6.2, 2.0.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      I have a method that adds a row index column to a dataframe. It only works correctly if the dataframe has less than 200 columns. When more than 200 columns nearly all the data becomes empty (""'s for values).

      def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = {
          val nullable = false
           df.sparkSession.createDataFrame(
            df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)},
            StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, nullable))
          )
        }
      

      This might be related to https://issues.apache.org/jira/browse/SPARK-16664 but I'm not sure. I saw the 200 column threshold and it made me think it might be related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is fixed in 2.0.1 (have not tried yet). I have no idea why the 200 column threshold is significant.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                barrybecker4 Barry Becker
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: