Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2580

broken pipe collecting schemardd results

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.3, 1.1.0
    • Component/s: PySpark, SQL
    • Labels:
    • Environment:

      fedora 21 local and rhel 7 clustered (standalone)

    • Target Version/s:

      Description

      from pyspark.sql import SQLContext
      sqlCtx = SQLContext(sc)
      # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
      data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
      sdata = sqlCtx.inferSchema(data)
      sdata.first()
      

      result: note - result returned as well as error

      >>> sdata.first()
      14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at PythonRDD.scala:290
      14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
      14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at PythonRDD.scala:290)
      14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
      14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
      14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
      14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, finish = 2
      14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at PythonRDD.scala:290, took 0.048348426 s
      {u'name': u'index', u'value': 0}
      >>> PySpark worker failed with exception:
      Traceback (most recent call last):
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py", line 77, in main
          serializer.dump_stream(func(split_index, iterator), outfile)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", line 191, in dump_stream
          self.serializer.dump_stream(self._batched(iterator), stream)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", line 124, in dump_stream
          self._write_with_length(obj, stream)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py", line 139, in _write_with_length
          stream.write(serialized)
      IOError: [Errno 32] Broken pipe
      
      Traceback (most recent call last):
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line 130, in launch_worker
          worker(listen_sock)
        File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py", line 119, in worker
          outfile.flush()
      IOError: [Errno 32] Broken pipe
      

        Attachments

          Activity

            People

            • Assignee:
              davies Davies Liu
              Reporter:
              farrellee Matthew Farrellee
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: