Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2079

Support batching when serializing SchemaRDD to Python

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.0.1, 1.1.0
    • PySpark, SQL
    • None

    Description

      Finishing the TODO:

        private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
          val fieldNames: Seq[String] = this.queryExecution.analyzed.output.map(_.name)
          this.mapPartitions { iter =>
            val pickle = new Pickler
            iter.map { row =>
              val map: JMap[String, Any] = new java.util.HashMap
              // TODO: We place the map in an ArrayList so that the object is pickled to a List[Dict].
              // Ideally we should be able to pickle an object directly into a Python collection so we
              // don't have to create an ArrayList every time.
              val arr: java.util.ArrayList[Any] = new java.util.ArrayList
              row.zip(fieldNames).foreach { case (obj, name) =>
                map.put(name, obj)
              }
              arr.add(map)
              pickle.dumps(arr)
            }
          }
        }
      

      Attachments

        Activity

          People

            kzhang Kan Zhang
            kzhang Kan Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: