Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-15376

DataFrame write.jdbc() inserts more rows than acutal

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.4.1, 1.5.0, 1.6.1
    • None
    • None
    • CentOS 6 cluster mode
      Cores: 300 (300 granted, 0 left)
      Executor Memory: 45.0 GB
      Submit Date: Wed May 18 10:26:40 CST 2016

    Description

      It's a odd bug, occur under this situation:

      Bar.scala
          val rddRaw = sc.textFile("xxx").map(xxx).sample(false, 0.15)
          println(rddRaw.count())    // the actual rows insert to mysql is more than rdd's record num. In my case, is 239994 (rdd),  ~241300 (database inserted)
      
          // iter all rows in another way, if drop the Range for loop, the bug wouldn't occur
          for(some_id <- Range(some_ids_all_range)){
            rddRaw.filter(_._2 == some_id).randomSplit(Array(x, x, x), 1)
            .foreach( rd => {
            // val curCnt = rd.count()  // if invoke count() on rd before write, it would be ok
                rd.map(x => new TestRow(null, xxx)).toDF().write.mode(SaveMode.Append).jdbc(xxx)
              }
            )
          }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            chenxiaoyu3 xiaoyu chen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: