Uploaded image for project: 'TinkerPop'
  1. TinkerPop
  2. TINKERPOP-1082

INPUT_RDD and INPUT_FORMAT are bad, we should just have one key.

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Implemented
    • Affects Version/s: 3.1.0-incubating
    • Fix Version/s: 3.2.0-incubating
    • Component/s: hadoop
    • Labels:
      None

      Description

      Right now we have two keys for input to a HadoopGraph.

      • gremlin.hadoop.graphInputFormat
      • gremlin.spark.graphInputRDD

      Likewise for output. I have so many if/else checks because both of these can be set that I think we should make one and only one input.

      • gremlin.hadoop.graphInputClass

      Likewise, for output: gremlin.hadoop.graphOutputClass.

      This will make things so much simpler. However, it will break current implementations (and not backwards compatible).

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/incubator-tinkerpop/pull/268

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/incubator-tinkerpop/pull/268
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user PommeVerte commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/268#issuecomment-198614584

        This is a good change to have VOTE +1

        Show
        githubbot ASF GitHub Bot added a comment - Github user PommeVerte commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/268#issuecomment-198614584 This is a good change to have VOTE +1
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user twilmes commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/268#issuecomment-198541486

        Good simplification of the config: VOTE +1

        Show
        githubbot ASF GitHub Bot added a comment - Github user twilmes commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/268#issuecomment-198541486 Good simplification of the config: VOTE +1
        Hide
        githubbot ASF GitHub Bot added a comment -
        Show
        githubbot ASF GitHub Bot added a comment - Github user okram commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/268#issuecomment-197990762 Docs have been published: http://tinkerpop.apache.org/docs/3.2.0-SNAPSHOT/reference VOTE +1.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user okram opened a pull request:

        https://github.com/apache/incubator-tinkerpop/pull/268

        TINKERPOP-1082 & TINKERPOP-1222: Hadoop Configuration Updates

        https://issues.apache.org/jira/browse/TINKERPOP-1082
        https://issues.apache.org/jira/browse/TINKERPOP-1222

        We had a very confusing situation with `gremlin.hadoop.graphInputFormat` and `gremlin.spark.graphInputRDD`. Not only did it cause a mess of `[WARN]` messages it was awkward as users had to know that one overrode the other. To make this cleaner, I created a new configuration called `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter` that can either take an `XXXFormat` or an `XXXRDD`. Internally, Spark/Giraph/etc. know how to reason on what is what.

        Finally, added `gremlin.hadoop.defaultGraphComputer` where users can specify a default `GraphComputer` in their proprties file and if so, `graph.compute()` will no longer throw an exception saying to use `graph.compute(class)`.

        Both of these changes are backwards compatible where there backwards compatibility is tested via `SparkHadoopGraphProvider` where via a coin-flip, sometimes the old model is used and sometimes the new model is used.

        Finally, I forgot to add docs on `GraphFilter` and they have been added to this PR.

        CHANGELOG

        ```

        • Added `gremlin.hadoop.defaultGraphComputer` so users can use `graph.compute()` with `HadoopGraph`.
        • Added `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter` which can handled `XXXFormats` and `XXXRDDs`.
        • Deprecated `gremlin.hadoop.graphInputFormat`, `gremlin.hadoop.graphOutputFormat`, `gremlin.spark.graphInputRDD`, and `gremlin.spark.graphOuputRDD`.
          ```

        UPDATE

        ```
        Hadoop Configurations
        ++++++++++++++++++

        Note that `gremlin.hadoop.graphInputFormat`, `gremlin.hadoop.graphOutputFormat`, `gremlin.spark.graphInputRDD`, and `gremlin.spark.graphOuputRDD` have all been deprecated. Using them still works, but moving forward, users only need to leverage `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter`. An example properties file snippet is provided below.

        gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
        gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
        gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
        gremlin.hadoop.jarsInDistributedCache=true
        gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
        ```

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1082

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/incubator-tinkerpop/pull/268.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #268


        commit 6411d0d4142770f93fb1a188d7e991ed1b4355f3
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-03-16T22:01:37Z

        gremlin.hadoop.graphReader and gremlin.hadoop.graphWriter are the new configurations replacing gremlin.hadoop.graphInputFormat and spark.graphInputRDD. Now HadoopGraph can handle either RDD or XXXFormats. Cleaner configurations. Backwards compatible. The older keys just map to the new keys inside HadoopConfiguration.

        commit b7f617b383700390128fca53de48f60cda3211fe
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-03-16T22:26:22Z

        fixed up the conf/.properties to use graphReader/graphWriter. Found more areas where inputFormat/outputFormat was still being used. Tested Giraph and its passing completely now. Need a helper utility that converts any Reader/Writer into an InputFormat or OutputFormat automagically.

        commit 13561b81aa8287c696b8d79befce42f84792f793
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-03-16T22:49:47Z

        ConfUtil does the dirty work of InputRDD or InputFormat conversion to an InputFormat.

        commit 5f53589b487ab918719315db6047233fb13971ae
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-03-17T14:42:57Z

        added gremlin.hadoop.defaultGraphComputer which allows users to specify in their properties file which GraphComputer to use by default. This allows providers that only support one Hadoop-based OLAP engine to 'hard set' the implementation so the syntax is cleaner – graph.compute() vs. graph.compute(GiraphGraphComputer.class). This is backwards compatible. The SparkHadoopGraphProvider has been updated to sometimes use compute() and sometimes use compute(class).

        commit 4a130d9092bc37dac252536280d60158fe75f74c
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-03-17T15:09:16Z

        updated docs on GraphFilter and graphReader/graphWriter.

        commit 5a9f56d53741c985982d2bb13d3d8f31ffb6dd85
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-03-17T15:32:04Z

        gremlin.hadoop.graphInputFormat.hasEdges is not gremlin.hadoop.graphReader.hasEdges. Likewise for graphOuputFormat. Backwards compatible.


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/268 TINKERPOP-1082 & TINKERPOP-1222 : Hadoop Configuration Updates https://issues.apache.org/jira/browse/TINKERPOP-1082 https://issues.apache.org/jira/browse/TINKERPOP-1222 We had a very confusing situation with `gremlin.hadoop.graphInputFormat` and `gremlin.spark.graphInputRDD`. Not only did it cause a mess of ` [WARN] ` messages it was awkward as users had to know that one overrode the other. To make this cleaner, I created a new configuration called `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter` that can either take an `XXXFormat` or an `XXXRDD`. Internally, Spark/Giraph/etc. know how to reason on what is what. Finally, added `gremlin.hadoop.defaultGraphComputer` where users can specify a default `GraphComputer` in their proprties file and if so, `graph.compute()` will no longer throw an exception saying to use `graph.compute(class)`. Both of these changes are backwards compatible where there backwards compatibility is tested via `SparkHadoopGraphProvider` where via a coin-flip, sometimes the old model is used and sometimes the new model is used. Finally, I forgot to add docs on `GraphFilter` and they have been added to this PR. CHANGELOG ``` Added `gremlin.hadoop.defaultGraphComputer` so users can use `graph.compute()` with `HadoopGraph`. Added `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter` which can handled `XXXFormats` and `XXXRDDs`. Deprecated `gremlin.hadoop.graphInputFormat`, `gremlin.hadoop.graphOutputFormat`, `gremlin.spark.graphInputRDD`, and `gremlin.spark.graphOuputRDD`. ``` UPDATE ``` Hadoop Configurations ++++++++++++++++++ Note that `gremlin.hadoop.graphInputFormat`, `gremlin.hadoop.graphOutputFormat`, `gremlin.spark.graphInputRDD`, and `gremlin.spark.graphOuputRDD` have all been deprecated. Using them still works, but moving forward, users only need to leverage `gremlin.hadoop.graphReader` and `gremlin.hadoop.graphWriter`. An example properties file snippet is provided below. gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1082 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/268.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #268 commit 6411d0d4142770f93fb1a188d7e991ed1b4355f3 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-03-16T22:01:37Z gremlin.hadoop.graphReader and gremlin.hadoop.graphWriter are the new configurations replacing gremlin.hadoop.graphInputFormat and spark.graphInputRDD. Now HadoopGraph can handle either RDD or XXXFormats. Cleaner configurations. Backwards compatible. The older keys just map to the new keys inside HadoopConfiguration. commit b7f617b383700390128fca53de48f60cda3211fe Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-03-16T22:26:22Z fixed up the conf/.properties to use graphReader/graphWriter. Found more areas where inputFormat/outputFormat was still being used. Tested Giraph and its passing completely now. Need a helper utility that converts any Reader/Writer into an InputFormat or OutputFormat automagically. commit 13561b81aa8287c696b8d79befce42f84792f793 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-03-16T22:49:47Z ConfUtil does the dirty work of InputRDD or InputFormat conversion to an InputFormat. commit 5f53589b487ab918719315db6047233fb13971ae Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-03-17T14:42:57Z added gremlin.hadoop.defaultGraphComputer which allows users to specify in their properties file which GraphComputer to use by default. This allows providers that only support one Hadoop-based OLAP engine to 'hard set' the implementation so the syntax is cleaner – graph.compute() vs. graph.compute(GiraphGraphComputer.class). This is backwards compatible. The SparkHadoopGraphProvider has been updated to sometimes use compute() and sometimes use compute(class). commit 4a130d9092bc37dac252536280d60158fe75f74c Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-03-17T15:09:16Z updated docs on GraphFilter and graphReader/graphWriter. commit 5a9f56d53741c985982d2bb13d3d8f31ffb6dd85 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-03-17T15:32:04Z gremlin.hadoop.graphInputFormat.hasEdges is not gremlin.hadoop.graphReader.hasEdges. Likewise for graphOuputFormat. Backwards compatible.
        Hide
        okram Marko A. Rodriguez added a comment -

        I went with gremlin.hadoop.graphReader and gremlin.hadoop.graphWriter as these terms align with the I/O package of graph readers and writers.

        Show
        okram Marko A. Rodriguez added a comment - I went with gremlin.hadoop.graphReader and gremlin.hadoop.graphWriter as these terms align with the I/O package of graph readers and writers.

          People

          • Assignee:
            okram Marko A. Rodriguez
            Reporter:
            okram Marko A. Rodriguez
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development