Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.4.0
-
None
-
None
Description
Hi Experts,
I work for GE Corporate. We have a Backup+Restore+extras project with AWS.
I faced with incompatibility issues when tried to convert back parquet files to CSV.
Our original sources (GreenPlum first) cannot process those backward converted files because of unproper quoting.
We work on several kinds of ERPs and TechDatabses and there are:
- multiline (CR,CRLF,LF) text fields
- mixed quoting inside the fields or just one double quote in a text field
- we have text field where EmptyString and Null values can be placed and has different meaning
Our last option combination is:
df.write.format("com.databricks.spark.csv").options(header='false',sep ='\013' ,multiLine ='true',escapeQuotes='true',quote = '"',nullValue ='
N', encoding='UTF-8').option("quoteAll", 'false').option("compression","gzip").mode('overwrite').save(s3_csv)
If I do not use escapeQuotes='true' it wont quote those fields where mixed or once occures a double quote.
If I use this it will escape emptyString double quotes [sep]\"\"[sep] . ==> Our Greenplum reader cannot read (Restoration) this format for emptyString.
It should be [sep]""[sep] or [sep][sep].
Can you help our project with proper quote and escape combination where data looks like this:
"2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA\" S.A. | 13-APR-20 "
"290208407
INT. RIEL DIN 2X32A 230/400V " |
""
I found an earlier option what you moved out: quoteMode.Non_Numeric.
Thank you in advance!
Regards,
Laszlo Torok