Pig
  1. Pig
  2. PIG-2556

CSVExcelStorage load: quoted field with newline as first character sees newline as record end

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
    • Patch Info:
      Patch Available

      Description

      Loading a record that contains a newline as the first character in a quoted field is broken. The loader interprets the quoted newline as the record delimiter. I've identified and fixed the bug and added a new testcase to expose it. I'll post a patch soon.

      1. csv3.patch
        6 kB
        Vitalii Tymchyshyn
      2. patch_PIG2556_CSVExcelStorage.patch
        3 kB
        Peter Welch

        Activity

        Hide
        Dmitriy V. Ryaboy added a comment -

        Committed to trunk. Thanks Vitalii and Peter!

        D

        Show
        Dmitriy V. Ryaboy added a comment - Committed to trunk. Thanks Vitalii and Peter! D
        Hide
        Vitalii Tymchyshyn added a comment -

        OK, I've recreated patch with better spacing (csv3.patch). I've set up my IDE for tab to be 8 chars, don't use tabs, use 4-chars indent. There still can be problems because original file uses tabs. I can reformat it without tabs, but did not do it for now.

        Show
        Vitalii Tymchyshyn added a comment - OK, I've recreated patch with better spacing (csv3.patch). I've set up my IDE for tab to be 8 chars, don't use tabs, use 4-chars indent. There still can be problems because original file uses tabs. I can reformat it without tabs, but did not do it for now.
        Hide
        Vitalii Tymchyshyn added a comment -

        Patch with better padding

        Show
        Vitalii Tymchyshyn added a comment - Patch with better padding
        Hide
        Vitalii Tymchyshyn added a comment -

        I can see spacing problems in my patch (CSVExcelProblems-2.patch). I will fix settings im my IDE and resubmit.

        Show
        Vitalii Tymchyshyn added a comment - I can see spacing problems in my patch (CSVExcelProblems-2.patch). I will fix settings im my IDE and resubmit.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Assigning to Vitalii

        Show
        Dmitriy V. Ryaboy added a comment - Assigning to Vitalii
        Hide
        Dmitriy V. Ryaboy added a comment -

        Let me try that again..

        +                if (i == recordLen - 1) {
        +                    fieldBuffer.put(b);
        +					sawEmbeddedRecordDelimiter = true;
        +                }
        
        Show
        Dmitriy V. Ryaboy added a comment - Let me try that again.. + if (i == recordLen - 1) { + fieldBuffer.put(b); + sawEmbeddedRecordDelimiter = true ; + }
        Hide
        Dmitriy V. Ryaboy added a comment -

        Looks good, but the spacing is kind of crazy – did tabs get in or something?

        + if (i == recordLen - 1)

        { + fieldBuffer.put(b); + sawEmbeddedRecordDelimiter = true; + }
        Show
        Dmitriy V. Ryaboy added a comment - Looks good, but the spacing is kind of crazy – did tabs get in or something? + if (i == recordLen - 1) { + fieldBuffer.put(b); + sawEmbeddedRecordDelimiter = true; + }
        Hide
        Vitalii Tymchyshyn added a comment -
        Show
        Vitalii Tymchyshyn added a comment - Patch from https://github.com/apache/pig/pull/6/
        Hide
        Vitalii Tymchyshyn added a comment -

        Actually there is one more problem: "Test ""quoted""\ndata" also does not work OK. Fixed in second commit. Will attach updated patch to this ticket. Also code is in pull request

        Show
        Vitalii Tymchyshyn added a comment - Actually there is one more problem: "Test ""quoted""\ndata" also does not work OK. Fixed in second commit. Will attach updated patch to this ticket. Also code is in pull request
        Hide
        Vitalii Tymchyshyn added a comment -

        I've also got into this bug.
        Should have check jira for patch before fixing , but here is my pull request with patch in case this would make fix progress faster: https://github.com/apache/pig/pull/6

        Show
        Vitalii Tymchyshyn added a comment - I've also got into this bug. Should have check jira for patch before fixing , but here is my pull request with patch in case this would make fix progress faster: https://github.com/apache/pig/pull/6
        Hide
        Peter Welch added a comment -

        Patch file attached

        Show
        Peter Welch added a comment - Patch file attached
        Hide
        Peter Welch added a comment -

        Index: src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java
        ===================================================================
        — src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java (revision 1294285)
        +++ src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java (working copy)
        @@ -49,6 +49,7 @@

        String testFileCommaName = "testFileComma.csv";
        String testFileTabName = "testFileTab.csv";
        + String testFileNewlines = "testFileNewlines.csv";

        String testStrComma =
        "John,Doe,10\n" +
        @@ -124,7 +125,36 @@
        add(Util.createTuple(new String[]

        {"Frank","Clean","70"}

        ));
        }
        };

        • +
          + String[] testFileNewlinesArray = new String[]

          { + "One,Two,Three", + "123,\"\nSecond line\nThird line\", \"456\"" // notice that the space after the comma but before the quote + // is considered to be part of the 3rd field. TBD if that's correct. + }

          ;
          +
          + @SuppressWarnings("serial")
          + ArrayList<Tuple> testStrNewlinesResultTuples =
          + new ArrayList<Tuple>() {
          +

          Unknown macro: {+ add(Util.createTuple(new String[] {"One","Two","Three"}));+ add(Util.createTuple(new String[] {"123", "\nSecond line\nThird line"," 456"}));+ }

          + };
          +
          +
          + @Test
          + public void testNewline() throws IOException

          { + + // Read the test file: + String script = + "a = LOAD '" + testFileNewlines + "' " + + "USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE');"; + Util.registerMultiLineQuery(pigServer, script); + compareExpectedActual(testStrNewlinesResultTuples, "a"); + + + }

          +
          public TestCSVExcelStorage() throws ExecException, IOException

          { pigServer = new PigServer(ExecType.LOCAL); @@ -135,6 +165,7 @@ Util.createLocalInputFile(testFileCommaName, testStrCommaArray); Util.createLocalInputFile(testFileTabName, testStrTabArray); + Util.createLocalInputFile(testFileNewlines, testFileNewlinesArray); }

        @Test
        @@ -148,7 +179,7 @@
        assertEquals(Util.createTuple(new String[]

        {"foo", "bar", "baz"}

        ), it.next());
        }

        • @Test
          + @Test
          public void testQuotedCommas() throws IOException {
          String inputFileName = "TestCSVExcelStorage-quotedcommas.txt";
          Util.createLocalInputFile(inputFileName, new String[] {"\"foo,bar,baz\"", "fee,foe,fum"}

          );
          Index: src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
          ===================================================================

            • src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java (revision 1294285)
              +++ src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java (working copy)
              @@ -622,6 +622,10 @@
              // that entire field is quoted:
              getNextInQuotedField = true;
              evenQuotesSeen = true;
              + if (i == recordLen - 1) { + fieldBuffer.put(b); + sawEmbeddedRecordDelimiter = true; + }

              } else if (b == FIELD_DEL)

              { readField(fieldBuffer, getNextFieldID++); // end of the field }

              else {

        Show
        Peter Welch added a comment - Index: src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java =================================================================== — src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java (revision 1294285) +++ src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java (working copy) @@ -49,6 +49,7 @@ String testFileCommaName = "testFileComma.csv"; String testFileTabName = "testFileTab.csv"; + String testFileNewlines = "testFileNewlines.csv"; String testStrComma = "John,Doe,10\n" + @@ -124,7 +125,36 @@ add(Util.createTuple(new String[] {"Frank","Clean","70"} )); } }; + + String[] testFileNewlinesArray = new String[] { + "One,Two,Three", + "123,\"\nSecond line\nThird line\", \"456\"" // notice that the space after the comma but before the quote + // is considered to be part of the 3rd field. TBD if that's correct. + } ; + + @SuppressWarnings("serial") + ArrayList<Tuple> testStrNewlinesResultTuples = + new ArrayList<Tuple>() { + Unknown macro: {+ add(Util.createTuple(new String[] {"One","Two","Three"}));+ add(Util.createTuple(new String[] {"123", "\nSecond line\nThird line"," 456"}));+ } + }; + + + @Test + public void testNewline() throws IOException { + + // Read the test file: + String script = + "a = LOAD '" + testFileNewlines + "' " + + "USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE');"; + Util.registerMultiLineQuery(pigServer, script); + compareExpectedActual(testStrNewlinesResultTuples, "a"); + + + } + public TestCSVExcelStorage() throws ExecException, IOException { pigServer = new PigServer(ExecType.LOCAL); @@ -135,6 +165,7 @@ Util.createLocalInputFile(testFileCommaName, testStrCommaArray); Util.createLocalInputFile(testFileTabName, testStrTabArray); + Util.createLocalInputFile(testFileNewlines, testFileNewlinesArray); } @Test @@ -148,7 +179,7 @@ assertEquals(Util.createTuple(new String[] {"foo", "bar", "baz"} ), it.next()); } @Test + @Test public void testQuotedCommas() throws IOException { String inputFileName = "TestCSVExcelStorage-quotedcommas.txt"; Util.createLocalInputFile(inputFileName, new String[] {"\"foo,bar,baz\"", "fee,foe,fum"} ); Index: src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java =================================================================== src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java (revision 1294285) +++ src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java (working copy) @@ -622,6 +622,10 @@ // that entire field is quoted: getNextInQuotedField = true; evenQuotesSeen = true; + if (i == recordLen - 1) { + fieldBuffer.put(b); + sawEmbeddedRecordDelimiter = true; + } } else if (b == FIELD_DEL) { readField(fieldBuffer, getNextFieldID++); // end of the field } else {

          People

          • Assignee:
            Vitalii Tymchyshyn
            Reporter:
            Peter Welch
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development