Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4572

CSVExcelStorage treats newlines within fields as record seperator when input file is split

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Resolved
    • 0.12.0, 0.14.0
    • 0.17.0
    • piggybank
    • Amazon ElasticMapReduce AMI 3.6.0
      Apache Pig version 0.14.0 and 0.12.0
      Hadoop 2.4.0

    Description

      It seems that when a field enclosed by double-quotes contains a carriage return or linefeed, and the input file is bigger than the dfs blocksize, the input split does not honor CSVExcelStorage's treatment of newlines within fields.

      It seems that the input is split by the linefeed closest to the byte range defined for the split, and causes fields to become skewed.

      For example, 3190 Byte Text file containing 21 identical records such as the below:

      "John Doe""025719e8244c7c400b811ea349f2c18e""This is a multiline message:
      This is the second line.
      Thank you for listening."~"2012-08-24 09:16:02"

      Each line termination here is specified by a CRLF

      Run through a pig script:
      SET mapred.min.split.size 1024;
      SET mapred.max.split.size 1024;
      SET pig.noSplitCombination true;
      SET mapred.max.jobs.per.node 1;
      myinput_file = LOAD 's3://sourcebucket/inputfile.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'YES_MULTILINE','WINDOWS')
      AS(
      name:chararray,
      sysid:chararray,
      message:chararray,
      messagedate:chararray
      );
      myinput_tuples = FOREACH myinput_file GENERATE name;
      STORE myinput_tuples INTO '/output052/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',');

      Results in 4 output files:

      rw-rr- 1 hadoop supergroup 0 2015-05-26 07:19 /output052/_SUCCESS
      rw-rr- 1 hadoop supergroup 63 2015-05-26 07:19 /output052/part-m-00000
      rw-rr- 1 hadoop supergroup 54 2015-05-26 07:19 /output052/part-m-00001
      rw-rr- 1 hadoop supergroup 769 2015-05-26 07:19 /output052/part-m-00002
      rw-rr- 1 hadoop supergroup 25 2015-05-26 07:19 /output052/part-m-00003
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00000
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00001
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      John Doe
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00002
      This is the second line.
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      "Thank you for listening.~2012-08-24 09:16:02""
      John Doe""~025719e8244c7c400b811ea349f2c18e""~This is a multiline message:"
      [hadoop@master~]$ hadoop fs -cat /output052/part-m-00003
      This is the second line.

      Skewing occurs on the third part.

      Attachments

        1. SmallTest.txt
          3 kB
          Le Clue
        2. script.pig
          0.5 kB
          Le Clue

        Activity

          People

            szita Ádám Szita
            leclue Le Clue
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: