Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2079

Support batching when serializing SchemaRDD to Python

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.1, 1.1.0
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      Finishing the TODO:

        private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
          val fieldNames: Seq[String] = this.queryExecution.analyzed.output.map(_.name)
          this.mapPartitions { iter =>
            val pickle = new Pickler
            iter.map { row =>
              val map: JMap[String, Any] = new java.util.HashMap
              // TODO: We place the map in an ArrayList so that the object is pickled to a List[Dict].
              // Ideally we should be able to pickle an object directly into a Python collection so we
              // don't have to create an ArrayList every time.
              val arr: java.util.ArrayList[Any] = new java.util.ArrayList
              row.zip(fieldNames).foreach { case (obj, name) =>
                map.put(name, obj)
              }
              arr.add(map)
              pickle.dumps(arr)
            }
          }
        }
      

        Attachments

          Activity

            People

            • Assignee:
              kzhang Kan Zhang
              Reporter:
              kzhang Kan Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: