Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4228

SchemaTupleBackend error when working on a Spark 1.1.0 cluster

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.14.0
    • Fix Version/s: spark-branch
    • Component/s: spark
    • Labels:
    • Environment:

      spark-1.1.0

      Description

      Whenever I try to run a script on a Spark cluster, I get the following error:

      ERROR 0: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (...): java.lang.RuntimeException: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at [1-2[-1,-1]]
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:62)
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.hasNext(POOutputConsumerIterator.java:68)
      scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
      scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
      scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:34)
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.hasNext(POOutputConsumerIterator.java:68)
      scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
      scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      scala.collection.Iterator$class.foreach(Iterator.scala:727)
      scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      org.apache.spark.scheduler.Task.run(Task.scala:54)
      org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      java.lang.Thread.run(Thread.java:745)

      After debugging I have seen that the problem is inside SchemaTupleBackend. Although SparkLauncher initializes this class, when the job gets sent to the executors this is lost and when POOutputConsumerIterator tries to fetch the results, SchemaTupleBackend.newSchemaTupleFactory(...) is called, throwing a RuntimeException.

        Attachments

        1. movies_data.csv
          0.3 kB
          liyunzhang
        2. groupby.pig
          0.3 kB
          liyunzhang

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                Carlos Balduz Carlos Balduz
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: