Details
-
New Feature
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
It is useful to have a common data-generator for testing Hadoop and related projects. Such a tool
should be able to generate data in a specified format and should be able to use a Hadoop cluster
for speeding up the data-generation. This tool can then be used across Hadoop (e.g. GridMix3),
Pig, Hive, etc. reducing the need for each project to invent something like this itself.
We can use the data-generator used in PigMix2 (PIG-200) as a starting point. It is described
in http://wiki.apache.org/pig/DataGeneratorHadoop. Since it depends on the SDSU
Java library (http://www.eli.sdsu.edu/java-SDSU/) released under the GNU GPL, it has to be
modified a bit to eliminate this dependency before it can be included in Apache Hadoop.