[SPARK-2394] Make it easier to read LZO-compressed files from EC2 clusters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 1.0.0
Fix Version/s: None
Component/s: EC2, Input/Output
Labels:
- compression

Description

Amazon hosts a large Google n-grams data set on S3. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities.

The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average spark-ec2 cluster to read input compressed in this way.

This is what one has to go through to get a Spark cluster created with spark-ec2 to read LZO-compressed files:

Install the latest LZO release, perhaps via yum.
Download hadoop-lzo and build it. To build hadoop-lzo you need Maven.
Install Maven. For some reason, you cannot install Maven with yum, so install it manually.
Update your core-site.xml and spark-env.sh with the appropriate configs.
Make the appropriate calls to sc.newAPIHadoopFile.

This seems like a bit too much work for what we're trying to accomplish.

If we expect this to be a common pattern – reading LZO-compressed files from a spark-ec2 cluster – it would be great if we could somehow make this less painful.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Jul/14 22:09

Updated:: 24/Mar/15 15:09

Resolved:: 30/Dec/14 23:44