Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34050

Parquet 2 CSV conversion wrong quoting

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.4.0
    • None
    • PySpark
    • None

    Description

      Hi Experts,

       I work for GE Corporate. We have a Backup+Restore+extras project with AWS.

      I faced with incompatibility issues when tried to convert back parquet files to CSV.
      Our original sources (GreenPlum first) cannot process those backward converted files because of unproper quoting.

       

      We work on several kinds of ERPs and TechDatabses and there are:

      1. multiline (CR,CRLF,LF) text fields
      2. mixed quoting inside the fields or just one double quote in a text field
      3. we have text field where EmptyString and Null values can be placed and has different meaning

      Our last option combination is:
      df.write.format("com.databricks.spark.csv").options(header='false',sep ='\013' ,multiLine ='true',escapeQuotes='true',quote = '"',nullValue ='
      N', encoding='UTF-8').option("quoteAll", 'false').option("compression","gzip").mode('overwrite').save(s3_csv)

      If I do not use escapeQuotes='true' it wont quote those fields where mixed or once occures a double quote.
      If I use this it will escape emptyString double quotes [sep]\"\"[sep] . ==> Our Greenplum reader cannot read (Restoration) this format for emptyString.
      It should be [sep]""[sep] or [sep][sep].

      Can you help our project with proper quote and escape combination where data looks like this:

      "2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA\" S.A. | 13-APR-20 "

      "290208407

      INT. RIEL DIN 2X32A 230/400V "

      ""

      I found an earlier option what you moved out: quoteMode.Non_Numeric. 

      Thank you in advance!

       

      Regards,

      Laszlo Torok

      Attachments

        1. parquet2csv
          2 kB
          Laszlo Torok
        2. csv_2_parq
          4 kB
          Laszlo Torok
        3. Before_60e291fb0.csv.gz
          0.4 kB
          Laszlo Torok
        4. Before_142d26f3e0.csv.gz
          10 kB
          Laszlo Torok
        5. Before_10e526e33.csv.gz
          0.6 kB
          Laszlo Torok
        6. Before_0c43b1dc7.csv.gz
          1 kB
          Laszlo Torok
        7. After_PRQ2CSVConverion_part-00001-5610129a-bf88-4dda-86f2-878857e9ec54-c000.csv.gz
          2 kB
          Laszlo Torok

        Activity

          People

            Unassigned Unassigned
            laszlo.torok Laszlo Torok
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: