Details
-
New Feature
-
Status: Open
-
Trivial
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
CSVTextInputFormat is a configurable CSV parser tuned to most of the csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format.
CSVTextInputFormat takes any csv-encoded file and rearrange the fields into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier.
Attached are CSVTextInputFormat.java and a unit test for it. Both go into org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
This is compiled against hadoop-0.0.20.