Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16515

[SPARK][SQL] transformation script got failure for python script

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Run below SQL and get transformation script error for python script like below error message.
      Query SQL:

      CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS
      SELECT DISTINCT
        sessionid,
        wcs_item_sk
      FROM
      (
        FROM
        (
          SELECT
            wcs_user_sk,
            wcs_item_sk,
            (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
          FROM web_clickstreams
          WHERE wcs_item_sk IS NOT NULL
          AND   wcs_user_sk IS NOT NULL
          DISTRIBUTE BY wcs_user_sk
          SORT BY
            wcs_user_sk,
            tstamp_inSec -- "sessionize" reducer script requires the cluster by uid and sort by tstamp
        ) clicksAnWebPageType
        REDUCE
          wcs_user_sk,
          tstamp_inSec,
          wcs_item_sk
        USING 'python q2-sessionize.py 3600'
        AS (
          wcs_item_sk BIGINT,
          sessionid STRING)
      ) q02_tmp_sessionize
      CLUSTER BY sessionid
      

      Error Message:

      16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with status 1. Error: Traceback (most recent call last):
        File "q2-sessionize.py", line 49, in <module>
          user_sk, tstamp_str, item_sk  = line.strip().split("\t")
      ValueError: too many values to unpack
      	at org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
      	at org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
      	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
      	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
      	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
      	at org.apache.spark.scheduler.Task.run(Task.scala:85)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. Error: Traceback (most recent call last):
        File "q2-sessionize.py", line 49, in <module>
          user_sk, tstamp_str, item_sk  = line.strip().split("\t")
      ValueError: too many values to unpack
      	at org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144)
      	at org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181)
      	... 14 more
      
      16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess exited with status 1. Error: Traceback (most recent call last):
        File "q2-sessionize.py", line 49, in <module>
          user_sk, tstamp_str, item_sk  = line.strip().split("\t")
      ValueError: too many values to unpack
      ) [duplicate 1]
      

        Attachments

          Activity

            People

            • Assignee:
              adrian-wang Adrian Wang
              Reporter:
              jameszhouyi Yi Zhou

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment