Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36024

Switch the datasource example due to the depreciation of the dataset

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • 3.1.2
    • None
    • Documentation
    • None

    Description

      The S3 bucket that used for an example in "Integration with Cloud Infrastructures" document will be deleted on Jul 1, 2021 https://registry.opendata.aws/landsat-8/ 

      The dataset will move to another bucket but it requires `--request-payer requester` option so users have to pay S3 cost. https://registry.opendata.aws/usgs-landsat/

       

      So I think it's better to change the datasource like this.

      https://github.com/yoda-mon/spark/commit/cdb24acdbb57a429e5bf1729502653b91a600022

       

      I chose [NYC Taxi data| https://registry.opendata.aws/nyc-tlc-trip-records-pds/] here for an example.
      Unlike landat data it's not compressed, but it is just an example and there are several tutorials using Spark  (e.g. https://github.com/aws-samples/amazon-eks-apache-spark-etl-sample)

       

      Reed test result

      scala> sc.textFile("s3a://nyc-tlc/misc/taxi _zone_lookup.csv").take(10).foreach(println) "LocationID","Borough","Zone","service_zone" 1,"EWR","Newark Airport","EWR" 2,"Queens","Jamaica Bay","Boro Zone" 3,"Bronx","Allerton/Pelham Gardens","Boro Zone" 4,"Manhattan","Alphabet City","Yellow Zone" 5,"Staten Island","Arden Heights","Boro Zone" 6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone" 7,"Queens","Astoria","Boro Zone" 8,"Queens","Astoria Park","Boro Zone" 9,"Queens","Auburndale","Boro Zone"
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yoda-mon Leona Yoda
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: