Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-1117

InputFormatRDD.readGraphRDD requires a valid gremlin.hadoop.inputLocation, breaking InputFormats (Cassandra, HBase) that don't need one

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 3.2.0-incubating
    • 3.1.1-incubating
    • hadoop
    • None

    Description

      On line 43, the call to Constants.getSearchGraphLocation returns Optional.empty() if gremlin.hadoop.inputLocation=none as advised in Titan's CassandraInputFormat and HBaseInputFormat. Changing the readGraphRDD method to call .isPresent() and only set the storage location in the config if so allows SparkGraphComputer from the 3.2.0-SNAPSHOT branch to work with Titan via CassandraInputFormat in a traversal source:

      // Imports
      import java.util.Optional;
      
      @Override
      public JavaPairRDD<Object, VertexWritable> readGraphRDD(final Configuration configuration, final JavaSparkContext sparkContext) {
          final org.apache.hadoop.conf.Configuration hadoopConfiguration = ConfUtil.makeHadoopConfiguration(configuration);
          // This part was used directly in hadoopConfiguration.set(...)
          final Optional<String> searchGraph = Constants.getSearchGraphLocation(configuration.getString(Constants.GREMLIN_HADOOP_INPUT_LOCATION), FileSystemStorage.open(hadoopConfiguration));
          if (searchGraph.isPresent()) {
              hadoopConfiguration.set(configuration.getString(Constants.GREMLIN_HADOOP_INPUT_LOCATION), searchGraph.get());
          }
          return sparkContext.newAPIHadoopRDD(hadoopConfiguration, (Class<InputFormat<NullWritable, VertexWritable>>) hadoopConfiguration.getClass(Constants.GREMLIN_HADOOP_GRAPH_INPUT_FORMAT, InputFormat.class),
              NullWritable.class,
              VertexWritable.class)
              .mapToPair(tuple -> new Tuple2<>(tuple._2().get().id(), new VertexWritable(tuple._2().get())));
      

      I don't really understand the intended behaviour, so this is probably not the right thing to do. Would the addition of a configuration variable such as "gremlin.hadoop.inputLocationRequired" that defaults to true, and can be set to false for these other input formats work?

      Attachments

        Activity

          People

            okram Marko A. Rodriguez
            dylanht Dylan Bethune-Waddell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: