[SPARK-2079] Support batching when serializing SchemaRDD to Python - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.0.1, 1.1.0
Component/s: PySpark, SQL
Labels:
None

Description

Finishing the TODO:

  private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
    val fieldNames: Seq[String] = this.queryExecution.analyzed.output.map(_.name)
    this.mapPartitions { iter =>
      val pickle = new Pickler
      iter.map { row =>
        val map: JMap[String, Any] = new java.util.HashMap
        // TODO: We place the map in an ArrayList so that the object is pickled to a List[Dict].
        // Ideally we should be able to pickle an object directly into a Python collection so we
        // don't have to create an ArrayList every time.
        val arr: java.util.ArrayList[Any] = new java.util.ArrayList
        row.zip(fieldNames).foreach { case (obj, name) =>
          map.put(name, obj)
        }
        arr.add(map)
        pickle.dumps(arr)
      }
    }
  }

Attachments

Activity

People

Assignee:: Kan Zhang

Reporter:: Kan Zhang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jun/14 16:49

Updated:: 14/Jun/14 20:21

Resolved:: 14/Jun/14 20:21