[SPARK-34050] Parquet 2 CSV conversion wrong quoting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Hi Experts,

I work for GE Corporate. We have a Backup+Restore+extras project with AWS.

I faced with incompatibility issues when tried to convert back parquet files to CSV.
Our original sources (GreenPlum first) cannot process those backward converted files because of unproper quoting.

We work on several kinds of ERPs and TechDatabses and there are:

multiline (CR,CRLF,LF) text fields
mixed quoting inside the fields or just one double quote in a text field
we have text field where EmptyString and Null values can be placed and has different meaning

Our last option combination is:
df.write.format("com.databricks.spark.csv").options(header='false',sep ='\013' ,multiLine ='true',escapeQuotes='true',quote = '"',nullValue ='
N', encoding='UTF-8').option("quoteAll", 'false').option("compression","gzip").mode('overwrite').save(s3_csv)

If I do not use escapeQuotes='true' it wont quote those fields where mixed or once occures a double quote.
If I use this it will escape emptyString double quotes [sep]\"\"[sep] . ==> Our Greenplum reader cannot read (Restoration) this format for emptyString.
It should be [sep]""[sep] or [sep][sep].

Can you help our project with proper quote and escape combination where data looks like this:

"2607 - CREDIT MEMO - SOCIETATEA NATIONALA 'NUCLEARELECTRICA\" S.A. | 13-APR-20 "

"290208407

INT. RIEL DIN 2X32A 230/400V "

I found an earlier option what you moved out: quoteMode.Non_Numeric.

Thank you in advance!

Regards,

Laszlo Torok

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

parquet2csv
11/Jan/21 08:38
2 kB
Laszlo Torok
csv_2_parq
11/Jan/21 08:38
4 kB
Laszlo Torok
Before_60e291fb0.csv.gz
11/Jan/21 08:39
0.4 kB
Laszlo Torok
Before_142d26f3e0.csv.gz
11/Jan/21 08:39
10 kB
Laszlo Torok
Before_10e526e33.csv.gz
11/Jan/21 08:39
0.6 kB
Laszlo Torok
Before_0c43b1dc7.csv.gz
11/Jan/21 08:39
1 kB
Laszlo Torok
After_PRQ2CSVConverion_part-00001-5610129a-bf88-4dda-86f2-878857e9ec54-c000.csv.gz
11/Jan/21 08:44
2 kB
Laszlo Torok

Activity

People

Assignee:: Unassigned

Reporter:: Laszlo Torok

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 08/Jan/21 12:38

Updated:: 12/Dec/22 18:10