[SPARK-1443] Unable to Access MongoDB GridFS data with Spark using mongo-hadoop API - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Done
Affects Version/s: 0.9.0
Fix Version/s: None
Component/s: Input/Output, Java API, Spark Core
Labels:
- GridFS
- MongoDB
- Spark
- hadoop2
- java
Environment:

Java 1.7,Hadoop 2.2.0,Spark 0.9.0,Ubuntu 12.4,

Description

I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce API. previously i have successfully processed mongoDB collections with Apache spark using Mongo-Hadoop connector. now i'm unable to GridFS collections with the following code.

MongoConfigUtil.setInputURI(config, "mongodb://localhost:27017/pdfbooks.fs.chunks" );
MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
com.mongodb.hadoop.MongoInputFormat.class, Object.class,
BSONObject.class);
JavaRDD<String> words = mongoRDD.flatMap(new FlatMapFunction<Tuple2<Object,BSONObject>,
String>() {
@Override
public Iterable<String> call(Tuple2<Object, BSONObject> arg) {
System.out.println(arg._2.toString());
...
Please suggest/provide better API methods to access MongoDB GridFS data.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pavan Kumar Varma

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Apr/14 13:00

Updated:: 21/Sep/14 14:00

Resolved:: 21/Sep/14 14:00

Time Tracking

Estimated:

12h

Remaining:

12h

Logged:

Not Specified