Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2580

broken pipe collecting schemardd results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.0.3, 1.1.0
    • PySpark, SQL
    • fedora 21 local and rhel 7 clustered (standalone)

    Description

      from pyspark.sql import SQLContext
      sqlCtx = SQLContext(sc)
      # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
      data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
      sdata = sqlCtx.inferSchema(data)
      sdata.first()
      

      result: note - result returned as well as error

      >>> sdata.first()
      14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290
      14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
      14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at PythonRDD.scala:290)
      14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
      14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
      14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
      14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, finish = 2
      14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at PythonRDD.scala:290, took 0.048348426 s
      {u'name': u'index', u'value': 0}
      >>> PySpark worker failed with exception:
      Traceback (most recent call last):
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py", line 77, in main
          serializer.dump_stream(func(split_index, iterator), outfile)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", line 191, in dump_stream
          self.serializer.dump_stream(self._batched(iterator), stream)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", line 124, in dump_stream
          self._write_with_length(obj, stream)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", line 139, in _write_with_length
          stream.write(serialized)
      IOError: [Errno 32] Broken pipe
      
      Traceback (most recent call last):
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line 130, in launch_worker
          worker(listen_sock)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line 119, in worker
          outfile.flush()
      IOError: [Errno 32] Broken pipe
      

      Attachments

        Activity

          People

            davies Davies Liu
            farrellee Matthew Farrellee
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: