Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4059 Pig on Spark
  3. PIG-4228

SchemaTupleBackend error when working on a Spark 1.1.0 cluster

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.14.0
    • spark-branch
    • spark
    • spark-1.1.0

    Description

      Whenever I try to run a script on a Spark cluster, I get the following error:

      ERROR 0: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (...): java.lang.RuntimeException: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at [1-2[-1,-1]]
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:62)
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.hasNext(POOutputConsumerIterator.java:68)
      scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
      scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
      scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:34)
      org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.hasNext(POOutputConsumerIterator.java:68)
      scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
      scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
      scala.collection.Iterator$class.foreach(Iterator.scala:727)
      scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
      org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
      org.apache.spark.scheduler.Task.run(Task.scala:54)
      org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      java.lang.Thread.run(Thread.java:745)

      After debugging I have seen that the problem is inside SchemaTupleBackend. Although SparkLauncher initializes this class, when the job gets sent to the executors this is lost and when POOutputConsumerIterator tries to fetch the results, SchemaTupleBackend.newSchemaTupleFactory(...) is called, throwing a RuntimeException.

      Attachments

        1. movies_data.csv
          0.3 kB
          liyunzhang
        2. groupby.pig
          0.3 kB
          liyunzhang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Carlos Balduz Carlos Balduz
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: