Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.

      CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.

      Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.

      This is compiled against hadoop-0.0.20.

      1. CSVTextInputFormat.java
        8 kB
        Lance Norskog
      2. TestCSVTextFormat.java
        3 kB
        Lance Norskog

        Activity

        Hide
        Allen Wittenauer added a comment -

        Any chance this could get changed to CombineFile/MultiFile instead?

        Show
        Allen Wittenauer added a comment - Any chance this could get changed to CombineFile/MultiFile instead?
        Hide
        Lance Norskog added a comment -

        Artfully phrased. Ah, the virtues of the passive voice.

        I only learned enough file i/o to make this work. And I only work in small development datasets, not production. So, no, it never impinged that it would need more stuff to support multifile directories. This is in hadoop-20.0.2. I work in Mahout, not Hadoop. I'm not upgrading Hadoop until Mahout makes me.

        How did you envision this modification? It looks like the RecordReader would be public and would need a constructor matching this line:

        org.apache.hadoop.mapred.lib.CombineFileRecordReader<K, V>:144
        curReader = rrConstructor.newInstance(new Object []

        {split, jc, reporter, Integer.valueOf(idx)}

        );

        Show
        Lance Norskog added a comment - Artfully phrased. Ah, the virtues of the passive voice. I only learned enough file i/o to make this work. And I only work in small development datasets, not production. So, no, it never impinged that it would need more stuff to support multifile directories. This is in hadoop-20.0.2. I work in Mahout, not Hadoop. I'm not upgrading Hadoop until Mahout makes me. How did you envision this modification? It looks like the RecordReader would be public and would need a constructor matching this line: org.apache.hadoop.mapred.lib.CombineFileRecordReader<K, V>:144 curReader = rrConstructor.newInstance(new Object [] {split, jc, reporter, Integer.valueOf(idx)} );
        Hide
        Lance Norskog added a comment -

        Another use case: one Wikipedia format is:

        1: 1664968
        2: 3 747213 1664968 1691047 4095634 5535664
        

        which would read in as:

        1: 1664968
        2: 3 
        2: 747213 
        2: 1664968
        etc.
        
        Show
        Lance Norskog added a comment - Another use case: one Wikipedia format is: 1: 1664968 2: 3 747213 1664968 1691047 4095634 5535664 which would read in as: 1: 1664968 2: 3 2: 747213 2: 1664968 etc.
        Hide
        XiaoboGu added a comment -

        How do you handle CSV file header, or is it not supported?

        Show
        XiaoboGu added a comment - How do you handle CSV file header, or is it not supported?
        Hide
        Lance Norskog added a comment -

        Hadoop assumes that it will process several files of the same format. Will every CSV file have the same header? If you split a giant CSV file into many pieces, will you reproduce the header line on the 2nd through N file?

        Hadoop jobs are generally configured with total knowledge of the data. The mappers are hard-coded for the input formats.

        The code could include a rule for how to decide that the first line is a header and skip over it. That would be worth adding.

        Show
        Lance Norskog added a comment - Hadoop assumes that it will process several files of the same format. Will every CSV file have the same header? If you split a giant CSV file into many pieces, will you reproduce the header line on the 2nd through N file? Hadoop jobs are generally configured with total knowledge of the data. The mappers are hard-coded for the input formats. The code could include a rule for how to decide that the first line is a header and skip over it. That would be worth adding.
        Hide
        XiaoboGu added a comment -

        There are two senarioes,
        1. Single huge CSV file with header.
        2. Many middle CSV files with the same format and header.

        Show
        XiaoboGu added a comment - There are two senarioes, 1. Single huge CSV file with header. 2. Many middle CSV files with the same format and header.
        Hide
        Maksym Kovalenko added a comment -

        So what regex one would need to specify to parse the "normal" CSV that uses comma as a delimiter and happen to have comma in one of the values, for example:

        value1,value2,"more,complex,with,commas,value3"

        just providing "," as the pattern1 will no longer work as it will produce 7 columns for the above case instead of 3.

        Also consider the following use case when value contains a double quoute. In this case according to CSV escaping rules it has to be escaped by another double quote, for example:

        column1,"thank you, ""User"" for the report, again, thank you",column3

        Considering above two cases what value for pattern1 should I provide?

        I think configuration of CSVTextInputFormat would be more natural if instead of patterns, one had to provide delimiter character (comma by default) and quote character (double quote by default). Then I and other users won't have to struggle with possible regex patterns (see my questions above, I'm still curious if you can come up with one).

        Another benefit is that from delimiter and quote characters you can create any regexes that you need if necessary (if you want to stick to current implementation). By the way, right now you have some fragility in the implementation when you prepend user provided regex with a "
        ". This will break in case when user supplied pattern itself starts with "
        ".

        Show
        Maksym Kovalenko added a comment - So what regex one would need to specify to parse the "normal" CSV that uses comma as a delimiter and happen to have comma in one of the values, for example: value1,value2,"more,complex,with,commas,value3" just providing "," as the pattern1 will no longer work as it will produce 7 columns for the above case instead of 3. Also consider the following use case when value contains a double quoute. In this case according to CSV escaping rules it has to be escaped by another double quote, for example: column1,"thank you, ""User"" for the report, again, thank you",column3 Considering above two cases what value for pattern1 should I provide? I think configuration of CSVTextInputFormat would be more natural if instead of patterns, one had to provide delimiter character (comma by default) and quote character (double quote by default). Then I and other users won't have to struggle with possible regex patterns (see my questions above, I'm still curious if you can come up with one). Another benefit is that from delimiter and quote characters you can create any regexes that you need if necessary (if you want to stick to current implementation). By the way, right now you have some fragility in the implementation when you prepend user provided regex with a " ". This will break in case when user supplied pattern itself starts with " ".
        Hide
        Harsh J added a comment -

        I'd suggest reusing OpenCSV instead, if it is possible to. I do think the
        license is compatible, and it is well maintained.

        On Thursday, October 27, 2011, Maksym Kovalenko (Commented) (JIRA) <
        https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680]
        uses comma as a delimiter and happen to have comma in one of the values, for
        example:
        7 columns for the above case instead of 3.
        In this case according to CSV escaping rules it has to be escaped by another
        double quote, for example:
        instead of patterns, one had to provide delimiter character (comma by
        default) and quote character (double quote by default). Then I and other
        users won't have to struggle with possible regex patterns (see my questions
        above, I'm still curious if you can come up with one).
        any regexes that you need if necessary (if you want to stick to current
        implementation). By the way, right now you have some fragility in the
        implementation when you prepend user provided regex with a "
        ". This will
        break in case when user supplied pattern itself starts with "
        ".
        csv-style datasets I've found. The Hadoop samples I've seen all
        FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable
        key and parse the Text value as a CSV line. But, they are all custom-coded
        for the format.
        into the format required by a Mapper. You can drop fields & rearrange them.
        There is also a random sampling option to make training/test runs easier.
        org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
        administrators:
        https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa


        Harsh J

        Show
        Harsh J added a comment - I'd suggest reusing OpenCSV instead, if it is possible to. I do think the license is compatible, and it is well maintained. On Thursday, October 27, 2011, Maksym Kovalenko (Commented) (JIRA) < https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680 ] uses comma as a delimiter and happen to have comma in one of the values, for example: 7 columns for the above case instead of 3. In this case according to CSV escaping rules it has to be escaped by another double quote, for example: instead of patterns, one had to provide delimiter character (comma by default) and quote character (double quote by default). Then I and other users won't have to struggle with possible regex patterns (see my questions above, I'm still curious if you can come up with one). any regexes that you need if necessary (if you want to stick to current implementation). By the way, right now you have some fragility in the implementation when you prepend user provided regex with a " ". This will break in case when user supplied pattern itself starts with " ". csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format. into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier. org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src. administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa – Harsh J
        Hide
        Marcelo Elias Del Valle added a comment -

        Created an improved version of a CSVInputFormat, able to read multiline CSVs, just in case it interests: https://github.com/mvallebr/CSVInputFormat

        Show
        Marcelo Elias Del Valle added a comment - Created an improved version of a CSVInputFormat, able to read multiline CSVs, just in case it interests: https://github.com/mvallebr/CSVInputFormat
        Hide
        Christian Tzolov added a comment -

        Hi Marcelo, the multiline CSVInputFormat inherits the getSplits() implementation from the parent FileInputFormat. Therefore I see a potential risk of splitting one multiline record across two (or more) different splits.
        Is this a valid concern or I might be missing something?

        Show
        Christian Tzolov added a comment - Hi Marcelo, the multiline CSVInputFormat inherits the getSplits() implementation from the parent FileInputFormat. Therefore I see a potential risk of splitting one multiline record across two (or more) different splits. Is this a valid concern or I might be missing something?
        Hide
        Marcelo Elias Del Valle added a comment -

        Christian, this is a valid concern. Actually, when I created the first version of this input format, I had chosen to have the CSV line numbers as the keys. Indeed, it worked well until I tested it on a cluster (amazon EMR with 15 instances). When I did, I realized the line numbers wasn't a good key, as it wouldn't get the right results among cluster nodes.
        I fixed that to use the file position as input key, just as NLineInputFormat does (http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html)
        I have tested it a lot and so far I found no problems. However, if you find some problem I didn't see, please tell me, as I would be very interested in fixing it.

        Show
        Marcelo Elias Del Valle added a comment - Christian, this is a valid concern. Actually, when I created the first version of this input format, I had chosen to have the CSV line numbers as the keys. Indeed, it worked well until I tested it on a cluster (amazon EMR with 15 instances). When I did, I realized the line numbers wasn't a good key, as it wouldn't get the right results among cluster nodes. I fixed that to use the file position as input key, just as NLineInputFormat does ( http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html ) I have tested it a lot and so far I found no problems. However, if you find some problem I didn't see, please tell me, as I would be very interested in fixing it.
        Hide
        Marcelo Elias Del Valle added a comment -

        Oh, just to complement, I realized you possible meant something different from your question... You are concerned about a single CSV line be split in two among different splits, right? No, that won't happen because I wrote a custom reader, that reads N lines at a time. The getSplits method uses the reader to correct get N lines and perform the splits, so getSplits will never return half of a line, you can actually configure how many lines you want on each split.
        Yes, this is also a valid concern and I took care about it. I am sorry, I hadn't understood well your question the first time I read it.

        Show
        Marcelo Elias Del Valle added a comment - Oh, just to complement, I realized you possible meant something different from your question... You are concerned about a single CSV line be split in two among different splits, right? No, that won't happen because I wrote a custom reader, that reads N lines at a time. The getSplits method uses the reader to correct get N lines and perform the splits, so getSplits will never return half of a line, you can actually configure how many lines you want on each split. Yes, this is also a valid concern and I took care about it. I am sorry, I hadn't understood well your question the first time I read it.
        Hide
        Christian Tzolov added a comment -

        Ah, I've only looked at the CSVTextInputFormat, which doesn't override the getSplits(). CSVNLineInputFormat does indeed.

        So the CSVNLineInputFormat implementation reads the entire data set twice? Once to compute the splits and second pass for the actual read in the map tasks.
        While the double-passing approach is unavoidable (IMO) I wonder what is the performance (and perhaps the scalability) impact. Do you have any numbers comparing the standart vs. multiline implementations?
        Thanks, Chris

        Show
        Christian Tzolov added a comment - Ah, I've only looked at the CSVTextInputFormat, which doesn't override the getSplits(). CSVNLineInputFormat does indeed. So the CSVNLineInputFormat implementation reads the entire data set twice? Once to compute the splits and second pass for the actual read in the map tasks. While the double-passing approach is unavoidable (IMO) I wonder what is the performance (and perhaps the scalability) impact. Do you have any numbers comparing the standart vs. multiline implementations? Thanks, Chris
        Hide
        Marcelo Elias Del Valle added a comment -

        CSVTextInputFormat was my first try of doing this inputFormat, but I should remove it from github later... If you take a look at the example, you will see I am only using CSVNLineInputFormat. Please don't consider using this class (CSVTextInputFormat) as it probably doesn't work.

        Honestly, I would have the same concern you had when considering to use CSVTextInputFormat, as looking at getSplits code (http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29) I have the impression the file could be split in the middle of a line, even in a case where you have single line text files. I could be wrong, but to the best of my knowledge, this is how it works.

        However, if you use CSVTextInputFormat overriding the isSplittable() method to return FALSE, it could be useful and avoid two parses of the same file, if you have 1000s of small files instead of one huge file, like in my case. By doing that, you would assure 1 split per file.

        Show
        Marcelo Elias Del Valle added a comment - CSVTextInputFormat was my first try of doing this inputFormat, but I should remove it from github later... If you take a look at the example, you will see I am only using CSVNLineInputFormat. Please don't consider using this class (CSVTextInputFormat) as it probably doesn't work. Honestly, I would have the same concern you had when considering to use CSVTextInputFormat, as looking at getSplits code ( http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29 ) I have the impression the file could be split in the middle of a line, even in a case where you have single line text files. I could be wrong, but to the best of my knowledge, this is how it works. However, if you use CSVTextInputFormat overriding the isSplittable() method to return FALSE, it could be useful and avoid two parses of the same file, if you have 1000s of small files instead of one huge file, like in my case. By doing that, you would assure 1 split per file.

          People

          • Assignee:
            Unassigned
            Reporter:
            Lance Norskog
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development