Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41790

Set TRANSFORM reader and writer's format correctly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.1
    • 3.4.0
    • SQL
    • None

    Description

      We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now.  In theory, writer uses inFormat to feed to input data into the running script and reader uses outFormat to read the output from the running script, but inFormat and outFormat are set wrong value currently in the following code:

      val (inFormat, inSerdeClass, inSerdeProps, reader) =
        format(
          inRowFormat, "hive.script.recordreader",
          "org.apache.hadoop.hive.ql.exec.TextRecordReader")
      
      val (outFormat, outSerdeClass, outSerdeProps, writer) =
        format(
          outRowFormat, "hive.script.recordwriter",
          "org.apache.hadoop.hive.ql.exec.TextRecordWriter") 

       

      Example SQL:

      spark-sql> CREATE TABLE t1 (a string, b string); 
      
      spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4");
      
      spark-sql> SELECT TRANSFORM(a, b)
               >   ROW FORMAT DELIMITED
               >   FIELDS TERMINATED BY ','
               >   USING 'cat'
               >   AS (c)
               > FROM t1;
      c
      
      spark-sql> SELECT TRANSFORM(a, b)
               >   USING 'cat'
               >   AS (c)
               >   ROW FORMAT DELIMITED
               >   FIELDS TERMINATED BY ','
               > FROM t1;
      c
      1    23    4

       

      The same sql in hive:

      hive> SELECT TRANSFORM(a, b)
          >   ROW FORMAT DELIMITED
          >   FIELDS TERMINATED BY ','
          >   USING 'cat'
          >   AS (c)
          > FROM t1;
      c
      1,2
      3,4
      
      hive> SELECT TRANSFORM(a, b)
          >   USING 'cat'
          >   AS (c)
          >   ROW FORMAT DELIMITED
          >   FIELDS TERMINATED BY ','
          > FROM t1;
      c
      1    2
      3    4 

       

      Attachments

        Activity

          People

            jimmyma mattshma
            jimmyma mattshma
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: