Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1443

Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API

    XMLWordPrintableJSON

Details

    Description

      I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce API. previously i have successfully processed mongoDB collections with Apache spark using Mongo-Hadoop connector. now i'm unable to GridFS collections with the following code.

      MongoConfigUtil.setInputURI(config, "mongodb://localhost:27017/pdfbooks.fs.chunks" );
      MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
      JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
      com.mongodb.hadoop.MongoInputFormat.class, Object.class,
      BSONObject.class);
      JavaRDD<String> words = mongoRDD.flatMap(new FlatMapFunction<Tuple2<Object,BSONObject>,
      String>() {
      @Override
      public Iterable<String> call(Tuple2<Object, BSONObject> arg) {
      System.out.println(arg._2.toString());
      ...
      Please suggest/provide better API methods to access MongoDB GridFS data.

      Attachments

        Activity

          People

            Unassigned Unassigned
            PavanKumarVarma Pavan Kumar Varma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 12h
                12h
                Remaining:
                Remaining Estimate - 12h
                12h
                Logged:
                Time Spent - Not Specified
                Not Specified