Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2208

Flexible CSV text parser InputFormat

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.

      CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.

      Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.

      This is compiled against hadoop-0.0.20.

      Attachments

        1. CSVTextInputFormat.java
          8 kB
          Lance Norskog
        2. TestCSVTextFormat.java
          3 kB
          Lance Norskog

        Activity

          People

            Unassigned Unassigned
            lancenorskog Lance Norskog
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated: