Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4572

CSVExcelStorage treats newlines within fields as record seperator when input file is split

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Resolved
    • 0.12.0, 0.14.0
    • 0.17.0
    • piggybank
    • Amazon ElasticMapReduce AMI 3.6.0
      Apache Pig version 0.14.0 and 0.12.0
      Hadoop 2.4.0

    Description

      It seems that when a field enclosed by double-quotes contains a carriage return or linefeed, and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's treatment of newlines within fields.

      It seems that the input is split by the linefeed closest to the byte range defined for the split, and causes fields to become skewed.

      For example, 3190 Byte Text file containing 21 identical records such as the below:

      "John Doe""025719e8244c7c400b811ea349f2c18e""This is a multiline message:
      This is the second line.
      Thank you for listening."~"2012-08-24 09:16:02"

      Each line termination here is specified by a CRLF

      Run through a pig script:
      SET mapred.min.split.size 1024;
      SET mapred.max.split.size 1024;
      SET pig.noSplitCombination true;
      SET mapred.max.jobs.per.node 1;
      myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
      AS(
      name:chararray,
      sysid:chararray,
      message:chararray,
      messagedate:chararray
      );
      myinput_tuples = FOREACH myinput_file GENERATE name;
      STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');

      Results in 4 output files:

      rw-rr- 1 hadoop supergroup 0 2015-05-26 07:19 /output052/_SUCCESS
      rw-rr- 1 hadoop supergroup 63 2015-05-26 07:19 /output052/part-m-00000
      rw-rr- 1 hadoop supergroup 54 2015-05-26 07:19 /output052/part-m-00001
      rw-rr- 1 hadoop supergroup 769 2015-05-26 07:19 /output052/part-m-00002
      rw-rr- 1 hadoop supergroup 25 2015-05-26 07:19 /output052/part-m-00003
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
      This is the second line.
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
      This is the second line.

      Skewing occurs on the third part.

      Attachments

        1. script.pig
          0.5 kB
          Le Clue
        2. SmallTest.txt
          3 kB
          Le Clue

        Activity

          leclue Le Clue added a comment -

          Sample Input Data
          3190 Bytes

          leclue Le Clue added a comment - Sample Input Data 3190 Bytes
          leclue Le Clue added a comment -

          Sample Pig Script

          leclue Le Clue added a comment - Sample Pig Script
          Absolutesantaja Shawn Weeks added a comment -

          I've loaded several large 10GB+ files with embedded newlines and had it work when split but I'm starting to think it was blind luck that it didn't split on one of the embedded newlines. I'm facing this issue with a file where every line has an embedded newline in the same column and as luck would have it every split is on the embedded newline instead of the row delimiter newline.

          Absolutesantaja Shawn Weeks added a comment - I've loaded several large 10GB+ files with embedded newlines and had it work when split but I'm starting to think it was blind luck that it didn't split on one of the embedded newlines. I'm facing this issue with a file where every line has an embedded newline in the same column and as luck would have it every split is on the embedded newline instead of the row delimiter newline.
          szita Ádám Szita added a comment -

          Hi, I've taken a deep look into this. Beware, long story ahead (TL;DR at bottom)

          The problem roots from the way how Hadoop is loading text files line by line and how creates splits of them.
          It doesn't matter that we're making CSVExcelStorage know what field (~), record (\r\n) delimeters and embedded line breaks are used in the data, Hadoop will not have an idea about CSV records and embedded line breaks when it comes to reading the text file into splits.

          If not specified (by default it isn't) it will think that a normal line ending is the record delimeter, and use readDefaultLine method here: https://github.com/apache/hadoop/blob/release-2.4.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L169

          In our case we want to set this property: textinputformat.record.delimiter to something like "\r\n" so that readCustomLine is used and splitting is done correctly. Now setting this isn't easy in Pig for reasons described here: http://aaron.blog.archive.org/2013/05/27/customizing-pig-for-sort-order-and-line-termination
          I found the easiest way to be with the use of a property file which we supply to pig when it starts with the -P option.

          Also, once we have it set for "\r\n", we'll see that however this is perfect for separating data in your data, it will strip quotes (") from record beginnings and ends which CSVExcelStorage would heavily depend on..

          So what I came up with is to set it to "\r\n instead, it will keep the " char intact at record beginning, and only screw up the record ending one.
          However this is not a problem if we specify CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS') (the fact that the record's buffer doesn't contain more charaters and NO_MULTILINE is defined will cause CSVExcelStorage to save the current buffer without needing the missing closing " - yes this is hacky in a way.. we can think of it as multiline-ness being handled by Hadoop already instead)

          Summarized: try this:
          -create property file with the following content and give it to Pig with -P option:

          myprops.properties
          textinputformat.record.delimiter="\r\n
          

          -use NO_MULTILINE option in CSVExcelStorage instead:
          CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS')

          szita Ádám Szita added a comment - Hi, I've taken a deep look into this. Beware, long story ahead (TL;DR at bottom) The problem roots from the way how Hadoop is loading text files line by line and how creates splits of them. It doesn't matter that we're making CSVExcelStorage know what field (~), record (\r\n) delimeters and embedded line breaks are used in the data, Hadoop will not have an idea about CSV records and embedded line breaks when it comes to reading the text file into splits. If not specified (by default it isn't) it will think that a normal line ending is the record delimeter, and use readDefaultLine method here: https://github.com/apache/hadoop/blob/release-2.4.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L169 In our case we want to set this property: textinputformat.record.delimiter to something like "\r\n" so that readCustomLine is used and splitting is done correctly. Now setting this isn't easy in Pig for reasons described here: http://aaron.blog.archive.org/2013/05/27/customizing-pig-for-sort-order-and-line-termination I found the easiest way to be with the use of a property file which we supply to pig when it starts with the -P option. Also, once we have it set for "\r\n", we'll see that however this is perfect for separating data in your data, it will strip quotes (") from record beginnings and ends which CSVExcelStorage would heavily depend on.. So what I came up with is to set it to "\r\n instead, it will keep the " char intact at record beginning, and only screw up the record ending one. However this is not a problem if we specify CSVExcelStorage('~', ' NO_MULTILINE ','WINDOWS') (the fact that the record's buffer doesn't contain more charaters and NO_MULTILINE is defined will cause CSVExcelStorage to save the current buffer without needing the missing closing " - yes this is hacky in a way.. we can think of it as multiline-ness being handled by Hadoop already instead) Summarized: try this: -create property file with the following content and give it to Pig with -P option: myprops.properties textinputformat.record.delimiter="\r\n -use NO_MULTILINE option in CSVExcelStorage instead: CSVExcelStorage('~', 'NO_MULTILINE','WINDOWS')
          szita Ádám Szita added a comment -

          Absolutesantaja, leclue please let me know if the above works for you

          szita Ádám Szita added a comment - Absolutesantaja , leclue please let me know if the above works for you
          szita Ádám Szita added a comment -

          Resolving this now - feel free to reopen if you don't find this conclusive

          szita Ádám Szita added a comment - Resolving this now - feel free to reopen if you don't find this conclusive

          People

            szita Ádám Szita
            leclue Le Clue
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: