Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0-incubating
    • Fix Version/s: 3.1.1-incubating
    • Component/s: hadoop
    • Labels:
      None

      Description

      I think we can completely get away from HDFS for SparkGraphComputer. We will need something like PesistedSideEffectsRDD. Once we do that, if the user wants to use Spark without Hadoop, its possible.

      This would beg the question – do we go all the way and support SparkGraph ?

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user okram opened a pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192

        TINKERPOP-1033: Store sideEffects as a persisted RDD

        https://issues.apache.org/jira/browse/TINKERPOP-1033

        This is a massive amount of work. Just making sideEffects be stored as persisted RDDs led to a swath of other updates. Here is the list of things:

        • It is now possible for Spark users to completely avoid using HDFS – they simply use `PersistedInputRDD` and `PersistedOutputRDD` for everything.
        • Added a significant amount of testing to ensure that persisted RDDs work as expected in all situations.
        • `InputRDD`s now have a `readMemoryRDD()` method which handles reading sideEffects (i.e. memory).
        • `OutputRDD`s now have a `writeMemoryRDD()` method which handles writing sideEffects (i.e. memory).
        • There is a `Storage` interface in gremlin-core which providers can implement to have "file-system semantics" for their data source. HDFS and Spark both implement it. No more Groovy meta-programming for HDFS! Sweeeeet.
        • With `Storage` all the file management in both Spark and Giraph is much simpler as the methods in `Storage` allowed me to gut alot of (error prone) code.
        • There is a general test suite which makes sure both HDFS and Spark storage behave "the same."
        • Updated documentation, upgrade docs, and added JavaDoc to `Storage`.
        • The docs for `BulkLoaderVertexProgram` and Spark/Giraph uses a `data/` directory. It wasn't consistent with our other examples so I cleaned it up.
        • Fixed a minor bug in `ClusterCountMapReduce`.
        • Cleaned up how HDFS data is streamed – its pure now, based solely on `InputFormat` behavior (I learned something new in Hadoop).
        • There are a few minor "breaking changes" around `hdfs.methods()`. They are "ok" as HDFS interaction prior to this moment has always been manual via the Gremlin Console.

        I updated the "update" docs:

        http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/upgrade/#_storage_i_o

        I updated the "reference" docs:

        http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/reference/#_storage_systems

        You can see the JavaDoc for the new `Storage` interface:

        http://tinkerpop.apache.org/javadocs/3.1.1-SNAPSHOT/core/org/apache/tinkerpop/gremlin/structure/io/Storage.html

        I ran integration tests and built and deployed docs successfully.

        VOTE +1.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1033

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/incubator-tinkerpop/pull/192.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #192


        commit f3ebed0bde6ac889640cb136b50b362c5cd2d2ea
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-09T17:41:09Z

        InputRDD now has readMemoryRDD(). OutputRDD now has writeMemoryRDD(). InputFormatRDD and OutputFormatRDD took the code from SparkExecutor that uses SequenceFiles for output. As such, memory reading/writing has been generalized. Graph system providers that ONLY want to provide Spark support are not required to have HDFS as SparkServer can maintains all persisted data via graphRDD and memoryRDD. There is still more work to do. More tests cases is next.

        commit 58d9240764cd6e1f3779097966c53058264e00e6
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-09T20:46:43Z

        added Storage to gremlin-core. Storage is an interface that OLAP system can implement. It provides ls(), rmr(), rm(), etc. type methods that make it easy for users to interact (via a common interface) with the underlying persitance system. Now both HDFS and Spark provide their own Storage implementations and TADA. Really pretty.

        commit 2c0d327c04219de9fdf20444a100d3cb3dd1d221
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-09T20:48:49Z

        merged master and merged conflicts from @spmallettes changes to SparkGremlinPlugin and HadoopGremlinPlugin.

        commit b4d8e9608d4eca3ae177b28fe588518a9d77506c
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-09T22:58:50Z

        Greatly greatly simplified Hadoop OLTP and interactions with HDFS and SparkContext. The trend – dir/~g for graphs and dir/x for memory. A consistent persistence schema makes everything so much simpler. I always assumed this would be all generalized/blah/blah. Never actually did it so, hell, stick with a consistent schema and watch the code just fall away.

        commit 3fff8f546501d10a4c1d34762a626a2493e758be
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-09T23:57:28Z

        lots more clean up, tests, and organization. She is a real beauty.

        commit 74b9c8ecfe787ead99d79c127fd85a4fccd926ec
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-10T01:27:29Z

        migrated GiraphGraphComputer over to the new Storage model via FileSystemStorage for HDFS.

        commit 55165a572f5d07e1ca20be13b064843da18fc8e6
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2015-12-10T02:11:33Z

        cleanup HDFS if Persist.NOTHING.

        commit dbd4a5360a75d562df64eecd91cc8c12550adb10
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-01-05T22:54:14Z

        merged master into branch. Minor tweaks given @spmallette new work on TestDirectory stuffs.

        commit 53e57a73aa5316b44d5ef4917347a6ba8934a102
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-01-06T15:02:33Z

        breaking commit. ignore.

        commit b0f3e4a96ced7f45f5e823b9060eac9dd0be1f7e
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-01-06T17:26:46Z

        Storage is complete and has a really cool TestSuite. There are two types of Storage. FileSystemStorage (HDFS) and SparkContextStorage (persited RDDs). You can ls(), cp(), rm(), rmr(), head(), etc. There is a single abstract test suite called AbstractStorageCheck that confirms that both Spark and HDFS behave the same. Moved around and organized Hadoop test cases given the new developments.

        commit 5c9e81b0cebd8c3841e2442a8ef13b3d23d44295
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-01-06T22:58:18Z

        added documentation, upgrade docs, JavaDoc, more test cases, and fixed up some random inconsistencies in BulkLoaderVertexProgram documentation examples.

        commit a7db52bda732810fc8d5d3a8279a4f7095285d3d
        Author: Marko A. Rodriguez <okrammarko@gmail.com>
        Date: 2016-01-06T23:03:59Z

        Merge branch 'master' into TINKERPOP-1033


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/192 TINKERPOP-1033 : Store sideEffects as a persisted RDD https://issues.apache.org/jira/browse/TINKERPOP-1033 This is a massive amount of work. Just making sideEffects be stored as persisted RDDs led to a swath of other updates. Here is the list of things: It is now possible for Spark users to completely avoid using HDFS – they simply use `PersistedInputRDD` and `PersistedOutputRDD` for everything. Added a significant amount of testing to ensure that persisted RDDs work as expected in all situations. `InputRDD`s now have a `readMemoryRDD()` method which handles reading sideEffects (i.e. memory). `OutputRDD`s now have a `writeMemoryRDD()` method which handles writing sideEffects (i.e. memory). There is a `Storage` interface in gremlin-core which providers can implement to have "file-system semantics" for their data source. HDFS and Spark both implement it. No more Groovy meta-programming for HDFS! Sweeeeet. With `Storage` all the file management in both Spark and Giraph is much simpler as the methods in `Storage` allowed me to gut alot of (error prone) code. There is a general test suite which makes sure both HDFS and Spark storage behave "the same." Updated documentation, upgrade docs, and added JavaDoc to `Storage`. The docs for `BulkLoaderVertexProgram` and Spark/Giraph uses a `data/` directory. It wasn't consistent with our other examples so I cleaned it up. Fixed a minor bug in `ClusterCountMapReduce`. Cleaned up how HDFS data is streamed – its pure now, based solely on `InputFormat` behavior (I learned something new in Hadoop). There are a few minor "breaking changes" around `hdfs.methods()`. They are "ok" as HDFS interaction prior to this moment has always been manual via the Gremlin Console. I updated the "update" docs: http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/upgrade/#_storage_i_o I updated the "reference" docs: http://tinkerpop.apache.org/docs/3.1.1-SNAPSHOT/reference/#_storage_systems You can see the JavaDoc for the new `Storage` interface: http://tinkerpop.apache.org/javadocs/3.1.1-SNAPSHOT/core/org/apache/tinkerpop/gremlin/structure/io/Storage.html I ran integration tests and built and deployed docs successfully. VOTE +1. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1033 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #192 commit f3ebed0bde6ac889640cb136b50b362c5cd2d2ea Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-09T17:41:09Z InputRDD now has readMemoryRDD(). OutputRDD now has writeMemoryRDD(). InputFormatRDD and OutputFormatRDD took the code from SparkExecutor that uses SequenceFiles for output. As such, memory reading/writing has been generalized. Graph system providers that ONLY want to provide Spark support are not required to have HDFS as SparkServer can maintains all persisted data via graphRDD and memoryRDD. There is still more work to do. More tests cases is next. commit 58d9240764cd6e1f3779097966c53058264e00e6 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-09T20:46:43Z added Storage to gremlin-core. Storage is an interface that OLAP system can implement. It provides ls(), rmr(), rm(), etc. type methods that make it easy for users to interact (via a common interface) with the underlying persitance system. Now both HDFS and Spark provide their own Storage implementations and TADA. Really pretty. commit 2c0d327c04219de9fdf20444a100d3cb3dd1d221 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-09T20:48:49Z merged master and merged conflicts from @spmallettes changes to SparkGremlinPlugin and HadoopGremlinPlugin. commit b4d8e9608d4eca3ae177b28fe588518a9d77506c Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-09T22:58:50Z Greatly greatly simplified Hadoop OLTP and interactions with HDFS and SparkContext. The trend – dir/~g for graphs and dir/x for memory. A consistent persistence schema makes everything so much simpler. I always assumed this would be all generalized/blah/blah. Never actually did it so, hell, stick with a consistent schema and watch the code just fall away. commit 3fff8f546501d10a4c1d34762a626a2493e758be Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-09T23:57:28Z lots more clean up, tests, and organization. She is a real beauty. commit 74b9c8ecfe787ead99d79c127fd85a4fccd926ec Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-10T01:27:29Z migrated GiraphGraphComputer over to the new Storage model via FileSystemStorage for HDFS. commit 55165a572f5d07e1ca20be13b064843da18fc8e6 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2015-12-10T02:11:33Z cleanup HDFS if Persist.NOTHING. commit dbd4a5360a75d562df64eecd91cc8c12550adb10 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-01-05T22:54:14Z merged master into branch. Minor tweaks given @spmallette new work on TestDirectory stuffs. commit 53e57a73aa5316b44d5ef4917347a6ba8934a102 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-01-06T15:02:33Z breaking commit. ignore. commit b0f3e4a96ced7f45f5e823b9060eac9dd0be1f7e Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-01-06T17:26:46Z Storage is complete and has a really cool TestSuite. There are two types of Storage. FileSystemStorage (HDFS) and SparkContextStorage (persited RDDs). You can ls(), cp(), rm(), rmr(), head(), etc. There is a single abstract test suite called AbstractStorageCheck that confirms that both Spark and HDFS behave the same. Moved around and organized Hadoop test cases given the new developments. commit 5c9e81b0cebd8c3841e2442a8ef13b3d23d44295 Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-01-06T22:58:18Z added documentation, upgrade docs, JavaDoc, more test cases, and fixed up some random inconsistencies in BulkLoaderVertexProgram documentation examples. commit a7db52bda732810fc8d5d3a8279a4f7095285d3d Author: Marko A. Rodriguez <okrammarko@gmail.com> Date: 2016-01-06T23:03:59Z Merge branch 'master' into TINKERPOP-1033
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user okram commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169528513

        Integration test `BUILD SUCCESSFUL`.

        ```
        [INFO] Apache TinkerPop .................................. SUCCESS [4.800s]
        [INFO] Apache TinkerPop :: Gremlin Shaded ................ SUCCESS [2.300s]
        [INFO] Apache TinkerPop :: Gremlin Core .................. SUCCESS [34.224s]
        [INFO] Apache TinkerPop :: Gremlin Test .................. SUCCESS [11.772s]
        [INFO] Apache TinkerPop :: Gremlin Groovy ................ SUCCESS [32.672s]
        [INFO] Apache TinkerPop :: Gremlin Groovy Test ........... SUCCESS [6.828s]
        [INFO] Apache TinkerPop :: TinkerGraph Gremlin ........... SUCCESS [3:22.133s]
        [INFO] Apache TinkerPop :: Hadoop Gremlin ................ SUCCESS [5:02.151s]
        [INFO] Apache TinkerPop :: Spark Gremlin ................. SUCCESS [4:03.723s]
        [INFO] Apache TinkerPop :: Giraph Gremlin ................ SUCCESS [2:01:35.469s]
        [INFO] Apache TinkerPop :: Neo4j Gremlin ................. SUCCESS [18:06.653s]
        [INFO] Apache TinkerPop :: Gremlin Driver ................ SUCCESS [10.940s]
        [INFO] Apache TinkerPop :: Gremlin Server ................ SUCCESS [11:13.160s]
        [INFO] Apache TinkerPop :: Gremlin Console ............... SUCCESS [1:10.880s]
        [INFO] ------------------------------------------------------------------------
        [INFO] BUILD SUCCESS
        [INFO] ------------------------------------------------------------------------
        [INFO] Total time: 2:46:18.192s
        [INFO] Finished at: Wed Jan 06 19:01:26 MST 2016
        [INFO] Final Memory: 103M/708M
        ```

        Show
        githubbot ASF GitHub Bot added a comment - Github user okram commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169528513 Integration test `BUILD SUCCESSFUL`. ``` [INFO] Apache TinkerPop .................................. SUCCESS [4.800s] [INFO] Apache TinkerPop :: Gremlin Shaded ................ SUCCESS [2.300s] [INFO] Apache TinkerPop :: Gremlin Core .................. SUCCESS [34.224s] [INFO] Apache TinkerPop :: Gremlin Test .................. SUCCESS [11.772s] [INFO] Apache TinkerPop :: Gremlin Groovy ................ SUCCESS [32.672s] [INFO] Apache TinkerPop :: Gremlin Groovy Test ........... SUCCESS [6.828s] [INFO] Apache TinkerPop :: TinkerGraph Gremlin ........... SUCCESS [3:22.133s] [INFO] Apache TinkerPop :: Hadoop Gremlin ................ SUCCESS [5:02.151s] [INFO] Apache TinkerPop :: Spark Gremlin ................. SUCCESS [4:03.723s] [INFO] Apache TinkerPop :: Giraph Gremlin ................ SUCCESS [2:01:35.469s] [INFO] Apache TinkerPop :: Neo4j Gremlin ................. SUCCESS [18:06.653s] [INFO] Apache TinkerPop :: Gremlin Driver ................ SUCCESS [10.940s] [INFO] Apache TinkerPop :: Gremlin Server ................ SUCCESS [11:13.160s] [INFO] Apache TinkerPop :: Gremlin Console ............... SUCCESS [1:10.880s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 2:46:18.192s [INFO] Finished at: Wed Jan 06 19:01:26 MST 2016 [INFO] Final Memory: 103M/708M ```
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user dkuppitz commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169694891

        • `mvn clean install`: passed
        • integration tests: passed
        • `bin/process-docs.sh`: passed

        VOTE: +1

        Show
        githubbot ASF GitHub Bot added a comment - Github user dkuppitz commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169694891 `mvn clean install`: passed integration tests: passed `bin/process-docs.sh`: passed VOTE: +1
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user spmallette commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169827767

        This PR drops a few "internal" classes - folks shouldn't have been using those directly, but would it have been better to deprecate those as opposed to just removing completely?

        Seems like deprecation would have worked for:

        even though these are pretty low-level classes, it would be nice to stick to "not breaking (even if they are doing stuff they are not supposed to)" plan imo.

        Show
        githubbot ASF GitHub Bot added a comment - Github user spmallette commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169827767 This PR drops a few "internal" classes - folks shouldn't have been using those directly, but would it have been better to deprecate those as opposed to just removing completely? Seems like deprecation would have worked for: [HadoopLoader] ( https://github.com/apache/incubator-tinkerpop/pull/192/files#diff-* 55e3610726b342e666b34223b8270526) [HDFSTools] ( https://github.com/apache/incubator-tinkerpop/pull/192/files#diff-88ec5bbe9a2817117d62799b9e91a20a ) [SparkLoader] ( https://github.com/apache/incubator-tinkerpop/pull/192/files#diff-0c5aa03b09c929fa6440edb9bacc7e76 ) even though these are pretty low-level classes, it would be nice to stick to "not breaking (even if they are doing stuff they are not supposed to)" plan imo.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user okram commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169832850

        Here is the thing. `SparkLoader` was introduced in 3.1.1-SNAPSHOT so it okay to drop. `HadoopLoader` is all meta-programming Groovy stuff to get ls(), rm(), etc. to work in Gremlin Console. We can keep the the class, but we can't have it loaded else it will interfere with the new `FileSystemStorage`. However, I say we just drop it. Its so low level and all meta-programmy that if someone is using it, they are retarded.

        Finally, `HDFSTools`. Again, low level.... I can bring that class back, but people really shouldn't be using it. This is like an internal utility and so specific to TinkerPop filesystem stuff. ??

        Show
        githubbot ASF GitHub Bot added a comment - Github user okram commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169832850 Here is the thing. `SparkLoader` was introduced in 3.1.1-SNAPSHOT so it okay to drop. `HadoopLoader` is all meta-programming Groovy stuff to get ls(), rm(), etc. to work in Gremlin Console. We can keep the the class, but we can't have it loaded else it will interfere with the new `FileSystemStorage`. However, I say we just drop it. Its so low level and all meta-programmy that if someone is using it, they are retarded. Finally, `HDFSTools`. Again, low level.... I can bring that class back, but people really shouldn't be using it. This is like an internal utility and so specific to TinkerPop filesystem stuff. ??
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user spmallette commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169840450

        Yeah - they shouldn't be using them, but we know how that goes....make a class public and someone is gonna use it to their detriment or otherwise. anyway, i'd say just do one of the following then:

        1. bring them back and deprecate
        2. add something to upgrade docs to explain their removal - you already have the section - perhaps just a WARNING that explicitly mentions the classes.

        Show
        githubbot ASF GitHub Bot added a comment - Github user spmallette commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-169840450 Yeah - they shouldn't be using them, but we know how that goes....make a class public and someone is gonna use it to their detriment or otherwise. anyway, i'd say just do one of the following then: 1. bring them back and deprecate 2. add something to upgrade docs to explain their removal - you already have the section - perhaps just a WARNING that explicitly mentions the classes.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user spmallette commented on the pull request:

        https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-170012024

        VOTE: +1

        Show
        githubbot ASF GitHub Bot added a comment - Github user spmallette commented on the pull request: https://github.com/apache/incubator-tinkerpop/pull/192#issuecomment-170012024 VOTE: +1
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/incubator-tinkerpop/pull/192

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/incubator-tinkerpop/pull/192

          People

          • Assignee:
            okram Marko A. Rodriguez
            Reporter:
            okram Marko A. Rodriguez
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development