Pig
  1. Pig
  2. PIG-3141

Giving CSVExcelStorage an option to handle header rows

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11
    • Fix Version/s: 0.12.0
    • Component/s: piggybank
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Adds an argument to CSVExcelStorage to skip the header row when loading. This works properly with multiple small files each with a header being combined into one split, or a large file with a single header being split into multiple splits.

      Also fixes a few bugs with CSVExcelStorage, including PIG-2470 and a bug involving quoted fields at the end of a line not escaping properly.

      1. csv.patch
        88 kB
        Jonathan Packer
      2. csv_updated.patch
        89 kB
        Jonathan Packer
      3. PIG-3141_update_3.diff
        75 kB
        Jonathan Packer
      4. PIG-3141_update_4.diff
        76 kB
        Jonathan Packer

        Issue Links

          Activity

          Hide
          Jonathan Packer added a comment -

          See issue description

          Show
          Jonathan Packer added a comment - See issue description
          Hide
          Jonathan Coveney added a comment -

          Can you put this into RB?

          Show
          Jonathan Coveney added a comment - Can you put this into RB?
          Hide
          Kai Londenberg added a comment -

          Why remove the choice of delimiter ?

          While the name correctly suggests that commas are used to delimit fields, many datasets use another delimiter for very good reasons.

          Show
          Kai Londenberg added a comment - Why remove the choice of delimiter ? While the name correctly suggests that commas are used to delimit fields, many datasets use another delimiter for very good reasons.
          Hide
          Jonathan Packer added a comment -

          Re-adds the choice of delimiter, and makes the "header treatment" argument the last one to maintain backwards compatibility. All unit-tests from the old CSVExcelStorage pass.

          Show
          Jonathan Packer added a comment - Re-adds the choice of delimiter, and makes the "header treatment" argument the last one to maintain backwards compatibility. All unit-tests from the old CSVExcelStorage pass.
          Hide
          Jonathan Packer added a comment -

          Hi, I re-added the choice of delimiter. I'm bothered by the fact that you have to explicitly state a ',' delimiter if you ever want to use the other parameters, but I agree in retrospect that it's not worth removing the feature entirely. The lastest patch should be completely backwards compatible.

          Show
          Jonathan Packer added a comment - Hi, I re-added the choice of delimiter. I'm bothered by the fact that you have to explicitly state a ',' delimiter if you ever want to use the other parameters, but I agree in retrospect that it's not worth removing the feature entirely. The lastest patch should be completely backwards compatible.
          Hide
          Jonathan Coveney added a comment -

          Can you toss this into reviewboard?

          Show
          Jonathan Coveney added a comment - Can you toss this into reviewboard?
          Hide
          Jonathan Packer added a comment -

          fixed diff so it's from trunk instead of branch-0.11

          Show
          Jonathan Packer added a comment - fixed diff so it's from trunk instead of branch-0.11
          Hide
          Jonathan Packer added a comment -

          Hi, I put it on reviewboard here: https://reviews.apache.org/r/9697/ . Sorry, I hadn't known what that was earlier.

          Show
          Jonathan Packer added a comment - Hi, I put it on reviewboard here: https://reviews.apache.org/r/9697/ . Sorry, I hadn't known what that was earlier.
          Hide
          Cheolsoo Park added a comment -

          Thank you Jonathan P. for the patch! I made some comments to the RB.

          Show
          Cheolsoo Park added a comment - Thank you Jonathan P. for the patch! I made some comments to the RB.
          Hide
          Jonathan Packer added a comment -

          Updated diff with code review changes (also updated on ReviewBoard). Thanks for taking a look at this and the fixed-width patch Cheolsoo.

          Show
          Jonathan Packer added a comment - Updated diff with code review changes (also updated on ReviewBoard). Thanks for taking a look at this and the fixed-width patch Cheolsoo.
          Hide
          Cheolsoo Park added a comment -

          +1. Committed to trunk. Thanks Jonathan P!

          Note that I got rid of all the ^M's in the following files while committing them:

          • contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
          • contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java
          Show
          Cheolsoo Park added a comment - +1. Committed to trunk. Thanks Jonathan P! Note that I got rid of all the ^M's in the following files while committing them: contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java

            People

            • Assignee:
              Jonathan Packer
              Reporter:
              Jonathan Packer
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development