Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.

      CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.

      Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.

      This is compiled against hadoop-0.0.20.

      1. CSVTextInputFormat.java
        8 kB
        Lance Norskog
      2. TestCSVTextFormat.java
        3 kB
        Lance Norskog

        Activity

          People

          • Assignee:
            Unassigned
            Reporter:
            Lance Norskog
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:

              Development