Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2394

Make it easier to read LZO-compressed files from EC2 clusters

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 1.0.0
    • None
    • EC2, Input/Output

    Description

      Amazon hosts a large Google n-grams data set on S3. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities.

      The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average spark-ec2 cluster to read input compressed in this way.

      This is what one has to go through to get a Spark cluster created with spark-ec2 to read LZO-compressed files:

      1. Install the latest LZO release, perhaps via yum.
      2. Download hadoop-lzo and build it. To build hadoop-lzo you need Maven.
      3. Install Maven. For some reason, you cannot install Maven with yum, so install it manually.
      4. Update your core-site.xml and spark-env.sh with the appropriate configs.
      5. Make the appropriate calls to sc.newAPIHadoopFile.

      This seems like a bit too much work for what we're trying to accomplish.

      If we expect this to be a common pattern – reading LZO-compressed files from a spark-ec2 cluster – it would be great if we could somehow make this less painful.

      Attachments

        Activity

          People

            Unassigned Unassigned
            nchammas Nicholas Chammas
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: