[PIG-4572] CSVExcelStorage treats newlines within fields as record seperator when input file is split - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Resolved
Affects Version/s: 0.12.0, 0.14.0
Fix Version/s: 0.17.0
Component/s: piggybank
Labels:
- CSVExcelStorage
- pig
Environment:

Amazon ElasticMapReduce AMI 3.6.0
Apache Pig version 0.14.0 and 0.12.0
Hadoop 2.4.0

Description

It seems that when a field enclosed by double-quotes contains a carriage return or linefeed, and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's treatment of newlines within fields.

It seems that the input is split by the linefeed closest to the byte range defined for the split, and causes fields to become skewed.

For example, 3190 Byte Text file containing 21 identical records such as the below:

"John Doe"_{"025719e8244c7c400b811ea349f2c18e"}"This is a multiline message:
This is the second line.
Thank you for listening."~"2012-08-24 09:16:02"

Each line termination here is specified by a CRLF

Run through a pig script:
SET mapred.min.split.size 1024;
SET mapred.max.split.size 1024;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
AS(
name:chararray,
sysid:chararray,
message:chararray,
messagedate:chararray
);
myinput_tuples = FOREACH myinput_file GENERATE name;
STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');

Results in 4 output files:

~~rw-r~~r- 1 hadoop supergroup 0 2015-05-26 07:19 /output052/_SUCCESS
~~rw-r~~r- 1 hadoop supergroup 63 2015-05-26 07:19 /output052/part-m-00000
~~rw-r~~r- 1 hadoop supergroup 54 2015-05-26 07:19 /output052/part-m-00001
~~rw-r~~r- 1 hadoop supergroup 769 2015-05-26 07:19 /output052/part-m-00002
~~rw-r~~r- 1 hadoop supergroup 25 2015-05-26 07:19 /output052/part-m-00003
[hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
[hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
John Doe
John Doe
John Doe
John Doe
John Doe
John Doe
[hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
This is the second line.
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
"Thank you for listening.~2012-08-24 09:16:02""
John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
[hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
This is the second line.

Skewing occurs on the third part.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

script.pig
26/May/15 07:49
0.5 kB
Le Clue
SmallTest.txt
26/May/15 07:47
3 kB
Le Clue

Activity

Ascending order - Click to sort in descending order

Le Clue added a comment - 26/May/15 07:47

Sample Input Data
3190 Bytes

Le Clue added a comment - 26/May/15 07:47 Sample Input Data 3190 Bytes

Le Clue added a comment - 26/May/15 07:49

Sample Pig Script

Le Clue added a comment - 26/May/15 07:49 Sample Pig Script

Shawn Weeks added a comment - 18/Oct/16 03:26

I've loaded several large 10GB+ files with embedded newlines and had it work when split but I'm starting to think it was blind luck that it didn't split on one of the embedded newlines. I'm facing this issue with a file where every line has an embedded newline in the same column and as luck would have it every split is on the embedded newline instead of the row delimiter newline.

Shawn Weeks added a comment - 18/Oct/16 03:26 I've loaded several large 10GB+ files with embedded newlines and had it work when split but I'm starting to think it was blind luck that it didn't split on one of the embedded newlines. I'm facing this issue with a file where every line has an embedded newline in the same column and as luck would have it every split is on the embedded newline instead of the row delimiter newline.

Ádám Szita added a comment - 20/Oct/16 14:50

Hi, I've taken a deep look into this. Beware, long story ahead (TL;DR at bottom)

The problem roots from the way how Hadoop is loading text files line by line and how creates splits of them.
It doesn't matter that we're making CSVExcelStorage know what field (~), record (\r\n) delimeters and embedded line breaks are used in the data, Hadoop will not have an idea about CSV records and embedded line breaks when it comes to reading the text file into splits.

If not specified (by default it isn't) it will think that a normal line ending is the record delimeter, and use readDefaultLine method here: https://github.com/apache/hadoop/blob/release-2.4.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L169

In our case we want to set this property: textinputformat.record.delimiter to something like "\r\n" so that readCustomLine is used and splitting is done correctly. Now setting this isn't easy in Pig for reasons described here: http://aaron.blog.archive.org/2013/05/27/customizing-pig-for-sort-order-and-line-termination
I found the easiest way to be with the use of a property file which we supply to pig when it starts with the -P option.

Also, once we have it set for "\r\n", we'll see that however this is perfect for separating data in your data, it will strip quotes (") from record beginnings and ends which CSVExcelStorage would heavily depend on..

So what I came up with is to set it to "\r\n instead, it will keep the " char intact at record beginning, and only screw up the record ending one.
However this is not a problem if we specify CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS') (the fact that the record's buffer doesn't contain more charaters and NO_MULTILINE is defined will cause CSVExcelStorage to save the current buffer without needing the missing closing " - yes this is hacky in a way.. we can think of it as multiline-ness being handled by Hadoop already instead)

Summarized: try this:
-create property file with the following content and give it to Pig with -P option:

myprops.properties

textinputformat.record.delimiter="\r\n

-use NO_MULTILINE option in CSVExcelStorage instead:
CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS')

Ádám Szita added a comment - 20/Oct/16 14:50 Hi, I've taken a deep look into this. Beware, long story ahead (TL;DR at bottom) The problem roots from the way how Hadoop is loading text files line by line and how creates splits of them. It doesn't matter that we're making CSVExcelStorage know what field (~), record (\r\n) delimeters and embedded line breaks are used in the data, Hadoop will not have an idea about CSV records and embedded line breaks when it comes to reading the text file into splits. If not specified (by default it isn't) it will think that a normal line ending is the record delimeter, and use readDefaultLine method here: https://github.com/apache/hadoop/blob/release-2.4.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L169 In our case we want to set this property: textinputformat.record.delimiter to something like "\r\n" so that readCustomLine is used and splitting is done correctly. Now setting this isn't easy in Pig for reasons described here: http://aaron.blog.archive.org/2013/05/27/customizing-pig-for-sort-order-and-line-termination I found the easiest way to be with the use of a property file which we supply to pig when it starts with the -P option. Also, once we have it set for "\r\n", we'll see that however this is perfect for separating data in your data, it will strip quotes (") from record beginnings and ends which CSVExcelStorage would heavily depend on.. So what I came up with is to set it to "\r\n instead, it will keep the " char intact at record beginning, and only screw up the record ending one. However this is not a problem if we specify CSVExcelStorage('~', ' NO_MULTILINE ','WINDOWS') (the fact that the record's buffer doesn't contain more charaters and NO_MULTILINE is defined will cause CSVExcelStorage to save the current buffer without needing the missing closing " - yes this is hacky in a way.. we can think of it as multiline-ness being handled by Hadoop already instead) Summarized: try this: -create property file with the following content and give it to Pig with -P option: myprops.properties textinputformat.record.delimiter="\r\n -use NO_MULTILINE option in CSVExcelStorage instead: CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS')

Ádám Szita added a comment - 28/Oct/16 13:31

Absolutesantaja, leclue please let me know if the above works for you

Ádám Szita added a comment - 28/Oct/16 13:31 Absolutesantaja , leclue please let me know if the above works for you

Ádám Szita added a comment - 07/Nov/16 08:34

Resolving this now - feel free to reopen if you don't find this conclusive

Ádám Szita added a comment - 07/Nov/16 08:34 Resolving this now - feel free to reopen if you don't find this conclusive

People

Assignee:: Ádám Szita

Reporter:: Le Clue

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/May/15 07:46

Updated:: 21/Jun/17 09:22

Resolved:: 07/Nov/16 08:35