Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10310

[Spark SQL] All result records will be popluated into ONE line during the script transform due to missing the correct line/filed delimiter

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.1, 1.6.0
    • Component/s: SQL
    • Labels:
      None
    • Target Version/s:

      Description

      There is real case using python stream script in Spark SQL query. We found that all result records were wroten in ONE line as input from "select" pipeline for python script and so it caused script will not identify each record.Other, filed separator in spark sql will be '^A' or '\001' which is inconsistent/incompatible the '\t' in Hive implementation.

      Key query:

      CREATE VIEW temp1 AS
      SELECT *
      FROM
      (
        FROM
        (
          SELECT
            c.wcs_user_sk,
            w.wp_type,
            (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec
          FROM web_clickstreams c, web_page w
          WHERE c.wcs_web_page_sk = w.wp_web_page_sk
          AND   c.wcs_web_page_sk IS NOT NULL
          AND   c.wcs_user_sk     IS NOT NULL
          AND   c.wcs_sales_sk    IS NULL --abandoned implies: no sale
          DISTRIBUTE BY wcs_user_sk SORT BY wcs_user_sk, tstamp_inSec
        ) clicksAnWebPageType
        REDUCE
          wcs_user_sk,
          tstamp_inSec,
          wp_type
        USING 'python sessionize.py 3600'
        AS (
          wp_type STRING,
          tstamp BIGINT, 
          sessionid STRING)
      ) sessionized
      

      Key Python script:

      for line in sys.stdin:
           user_sk,  tstamp_str, value  = line.strip().split("\t")
      

      Sample SELECT result:

      ^V31^A3237764860^Afeedback^U31^A3237769106^Adynamic^T31^A3237779027^Areview
      

      Expected result:

      31   3237764860   feedback
      31   3237769106   dynamic
      31   3237779027   review
      

        Attachments

          Activity

            People

            • Assignee:
              zhichao-li zhichao-li
              Reporter:
              jameszhouyi Yi Zhou
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: