Amazon hosts a large Google n-grams data set on S3. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities.
The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average spark-ec2 cluster to read input compressed in this way.
This is what one has to go through to get a Spark cluster created with spark-ec2 to read LZO-compressed files:
- Install the latest LZO release, perhaps via yum.
- Download hadoop-lzo and build it. To build hadoop-lzo you need Maven.
- Install Maven. For some reason, you cannot install Maven with yum, so install it manually.
- Update your core-site.xml and spark-env.sh with the appropriate configs.
- Make the appropriate calls to sc.newAPIHadoopFile.
This seems like a bit too much work for what we're trying to accomplish.
If we expect this to be a common pattern – reading LZO-compressed files from a spark-ec2 cluster – it would be great if we could somehow make this less painful.