Giraph
  1. Giraph
  2. GIRAPH-36

Ensure that subclassing BasicVertex is possible by user apps

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.1.0
    • Component/s: graph
    • Labels:
      None

      Description

      Original assumptions in Giraph were that all users would subclass Vertex (which extended MutableVertex extended BasicVertex). Classes which wish to have application specific data structures (ie. not a TreeMap<I, Edge<I,E>>) may need to extend either MutableVertex or BasicVertex. Unfortunately VertexRange extends ArrayList<Vertex>, and there are other places where the assumption is that vertex classes are either Vertex, or at least MutableVertex.

      Let's make sure the internal APIs allow for BasicVertex to be the base class.

      1. GIRAPH-36.diff.warnings
        121 kB
        Avery Ching
      2. GIRAPH-36.diff
        122 kB
        Jake Mannix
      3. GIRAPH-36.diff
        124 kB
        Jake Mannix
      4. GIRAPH-36.diff
        122 kB
        Jake Mannix

        Issue Links

          Activity

          Hide
          Jakob Homan added a comment -

          Can GIRAPH-47 be a start for what you have in mind?

          Yeah, that's close to what I'm hoping for. I'll comment there.

          I can certainly imagine this being very nice and clean.

          Inheritance is going to be brittle over the long term, so the cleaner we can make this now, the easier it will be to attract a user base confident in investing in the new platform. This is particularly true given how much we're at the mercy of Java's poor generic system; users already have to grok things like DoubleDoubleFloat in type names.

          Show
          Jakob Homan added a comment - Can GIRAPH-47 be a start for what you have in mind? Yeah, that's close to what I'm hoping for. I'll comment there. I can certainly imagine this being very nice and clean. Inheritance is going to be brittle over the long term, so the cleaner we can make this now, the easier it will be to attract a user base confident in investing in the new platform. This is particularly true given how much we're at the mercy of Java's poor generic system; users already have to grok things like DoubleDoubleFloat in type names.
          Hide
          Jake Mannix added a comment -

          It's tricky, I think. The original Vertex class was exactly meant to hide (via private) some of the implementation details of state, yet leave userland APIs which the user is meant to implement. But then it turned out that the most generic way to hold onto the destination vertices and edge weights is not always the most efficient (hence: primitive-specific subclasses).

          But maybe you're just saying that we should pull out most of this "implementation specific" state into other objects, decompose Vertex a bit, and have users be able to pluggably implement the parts which need to be made special, and leave as generic those which are generic?

          Some stuff we already have done: GraphState encapsulates all global state (current superstep, number of global edges + vertices, etc). Local state could be similarly pulled off into a couple of different data structures (edges with weights, messages).

          I can certainly imagine this being very nice and clean.

          Open a ticket and describe some thoughts on how it could look in practice? Sounds like it could be another pretty significant refactoring, so let's do it sooner rather than later!

          Show
          Jake Mannix added a comment - It's tricky, I think. The original Vertex class was exactly meant to hide (via private) some of the implementation details of state, yet leave userland APIs which the user is meant to implement. But then it turned out that the most generic way to hold onto the destination vertices and edge weights is not always the most efficient (hence: primitive-specific subclasses). But maybe you're just saying that we should pull out most of this "implementation specific" state into other objects, decompose Vertex a bit, and have users be able to pluggably implement the parts which need to be made special, and leave as generic those which are generic? Some stuff we already have done: GraphState encapsulates all global state (current superstep, number of global edges + vertices, etc). Local state could be similarly pulled off into a couple of different data structures (edges with weights, messages). I can certainly imagine this being very nice and clean. Open a ticket and describe some thoughts on how it could look in practice? Sounds like it could be another pretty significant refactoring, so let's do it sooner rather than later!
          Hide
          Claudio Martella added a comment -

          Hi Jakob,

          as a matter of fact, I believe, part of your proposal is addressed by GIRAPH-47.
          Now that Jake has committed this work, I'll sync and complete that patch to the current trunk (I was out of office for conference the whole last week).

          Can GIRAPH-47 be a start for what you have in mind?

          Show
          Claudio Martella added a comment - Hi Jakob, as a matter of fact, I believe, part of your proposal is addressed by GIRAPH-47 . Now that Jake has committed this work, I'll sync and complete that patch to the current trunk (I was out of office for conference the whole last week). Can GIRAPH-47 be a start for what you have in mind?
          Hide
          Jakob Homan added a comment -

          Great work, Jake. One thing I notice while reading through Vertex is muddling of the state the Vertex is responsible for on a per-application basis and the state Giraph manages for the vertex. I think we may being ill-served by inheritance here and should instead rely on composition to hold this state. I'm thinking that messages in/out, edge state and mutation, facilities for sending messages, current superstep, etc. Would it be better in the long term to move these out to a context object (ala MR)? This would simplify Vertex significantly, make it much easier to test (by mocking out the dependency) and future proof the evolution of Vertex as there will be fewer moving parts to keep track of.

          Does changing compute and

          {pre|post} {Superstep|Application}

          took their external state as parameters with sound like a good approach to explore?

          Show
          Jakob Homan added a comment - Great work, Jake. One thing I notice while reading through Vertex is muddling of the state the Vertex is responsible for on a per-application basis and the state Giraph manages for the vertex. I think we may being ill-served by inheritance here and should instead rely on composition to hold this state. I'm thinking that messages in/out, edge state and mutation, facilities for sending messages, current superstep, etc. Would it be better in the long term to move these out to a context object (ala MR)? This would simplify Vertex significantly, make it much easier to test (by mocking out the dependency) and future proof the evolution of Vertex as there will be fewer moving parts to keep track of. Does changing compute and {pre|post} {Superstep|Application} took their external state as parameters with sound like a good approach to explore?
          Hide
          Avery Ching added a comment -

          I committed for Jake and Hudson seems happy. Resolving.

          Show
          Avery Ching added a comment - I committed for Jake and Hudson seems happy. Resolving.
          Hide
          Hudson added a comment -

          Integrated in Giraph-trunk-Commit #22 (See https://builds.apache.org/job/Giraph-trunk-Commit/22/)
          GIRAPH-36: Ensure that subclassing BasicVertex is possible by user
          apps. (jake.mannix via aching)

          aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196639
          Files :

          • /incubator/giraph/trunk/CHANGELOG
          • /incubator/giraph/trunk/pom.xml
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedService.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/VertexList.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimplePageRankVertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleShortestPathsVertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexResolver.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Edge.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexReader.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexResolver.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexOutputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/AdjacencyListVertexReader.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/LongDoubleDoubleAdjacencyListVertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/SequenceFileVertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextDoubleDoubleAdjacencyListVertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java
          • /incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperManager.java
          • /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestBspBasic.java
          • /incubator/giraph/trunk/src/test/java/org/apache/giraph/lib/TestAdjacencyListTextVertexOutputFormat.java
          • /incubator/giraph/trunk/src/test/java/org/apache/giraph/lib/TestLongDoubleDoubleAdjacencyListVertexInputFormat.java
          • /incubator/giraph/trunk/src/test/java/org/apache/giraph/lib/TestTextDoubleDoubleAdjacencyListVertexInputFormat.java
          Show
          Hudson added a comment - Integrated in Giraph-trunk-Commit #22 (See https://builds.apache.org/job/Giraph-trunk-Commit/22/ ) GIRAPH-36 : Ensure that subclassing BasicVertex is possible by user apps. (jake.mannix via aching) aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1196639 Files : /incubator/giraph/trunk/CHANGELOG /incubator/giraph/trunk/pom.xml /incubator/giraph/trunk/src/main/java/org/apache/giraph/benchmark/PseudoRandomVertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/bsp/CentralizedService.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/VertexList.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/WorkerCommunications.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/GeneratedVertexReader.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimplePageRankVertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleShortestPathsVertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/examples/SimpleSuperstepVertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BasicVertexResolver.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspService.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceMaster.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspServiceWorker.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/BspUtils.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Edge.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphMapper.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/GraphState.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/LongDoubleFloatDoubleVertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/MutableVertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/Vertex.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexRange.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexReader.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/graph/VertexResolver.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/AdjacencyListTextVertexOutputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/AdjacencyListVertexReader.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/JsonBase64VertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/LongDoubleDoubleAdjacencyListVertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/SequenceFileVertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextDoubleDoubleAdjacencyListVertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/lib/TextVertexInputFormat.java /incubator/giraph/trunk/src/main/java/org/apache/giraph/zk/ZooKeeperManager.java /incubator/giraph/trunk/src/test/java/org/apache/giraph/TestBspBasic.java /incubator/giraph/trunk/src/test/java/org/apache/giraph/lib/TestAdjacencyListTextVertexOutputFormat.java /incubator/giraph/trunk/src/test/java/org/apache/giraph/lib/TestLongDoubleDoubleAdjacencyListVertexInputFormat.java /incubator/giraph/trunk/src/test/java/org/apache/giraph/lib/TestTextDoubleDoubleAdjacencyListVertexInputFormat.java
          Hide
          Jake Mannix added a comment -

          Excellent, looks good to me.
          +1 for committing we can iterate from here is the main point. The sooner this gets in, the fewer other patches out there get broken.

          Show
          Jake Mannix added a comment - Excellent, looks good to me. +1 for committing we can iterate from here is the main point. The sooner this gets in, the fewer other patches out there get broken.
          Hide
          Avery Ching added a comment -

          Revised version of Jake's patch, addressing the warnings in src/main. Some minor cleanup.

          Show
          Avery Ching added a comment - Revised version of Jake's patch, addressing the warnings in src/main. Some minor cleanup.
          Hide
          Avery Ching added a comment -

          +1.
          Jake, just re-ran with your new patch. Passed both local and MR unittests. I noticed some warnings and tried to address them (some aren't your fault, from an earlier patch). I'm uploading a diff of your diff. Take a look and if it's fine I'll commit on your behalf. I didn't address the warnings in src/test/ava/org/apache/giraph/lib with mockito since I don't understand it as well as I should, but we should knock out those warnings when we can.

          Show
          Avery Ching added a comment - +1. Jake, just re-ran with your new patch. Passed both local and MR unittests. I noticed some warnings and tried to address them (some aren't your fault, from an earlier patch). I'm uploading a diff of your diff. Take a look and if it's fine I'll commit on your behalf. I didn't address the warnings in src/test/ava/org/apache/giraph/lib with mockito since I don't understand it as well as I should, but we should knock out those warnings when we can.
          Hide
          Jake Mannix added a comment -

          New patch. Nicer formatting, removed some more superfluous references to GraphState

          Show
          Jake Mannix added a comment - New patch. Nicer formatting, removed some more superfluous references to GraphState
          Hide
          Avery Ching added a comment -

          Since this is a somewhat large change, can you also run the tests on a hadoop instance (local is fine). I.e. 'mvn test -Dprop.mapred.job.tracker=localhost:50300'. I'll take a look at your diff tonight.

          Show
          Avery Ching added a comment - Since this is a somewhat large change, can you also run the tests on a hadoop instance (local is fine). I.e. 'mvn test -Dprop.mapred.job.tracker=localhost:50300'. I'll take a look at your diff tonight.
          Hide
          Jake Mannix added a comment -

          updated with formatting changes and hiding GraphState

          Show
          Jake Mannix added a comment - updated with formatting changes and hiding GraphState
          Hide
          Jake Mannix added a comment -

          Ok, got the setGraphState() stuff out, tests pass, with a little messing around.

          So formatting: I tracked down all of the stray "if(" bits, but do you know of any easy way to find files which have mixed spacing? I couldn't think of a good regex?

          Also, regarding committing: I never heard back from you or Owen about getting myself actually getting commit privileges. I never heard from infra@, etc.

          Show
          Jake Mannix added a comment - Ok, got the setGraphState() stuff out, tests pass, with a little messing around. So formatting: I tracked down all of the stray "if(" bits, but do you know of any easy way to find files which have mixed spacing? I couldn't think of a good regex? Also, regarding committing: I never heard back from you or Owen about getting myself actually getting commit privileges. I never heard from infra@, etc.
          Hide
          Jake Mannix added a comment -

          Hmmm... removing access to GraphState in userland code also makes it harder to unit test, as you have to do a bunch of setAccessible() calls, or convoluted mocking... oh well.

          Show
          Jake Mannix added a comment - Hmmm... removing access to GraphState in userland code also makes it harder to unit test, as you have to do a bunch of setAccessible() calls, or convoluted mocking... oh well.
          Hide
          Jake Mannix added a comment -

          setGraphState is called in GraphMapper, yes, but there are other places vertices get instantiated:

          MutableVertex#instantiateVertex()
          VertexResolver#instantiateVertex()

          In both of these classes, we have a handle on GraphState, so I guess we can just set it right after. I'll try that and see if there are any issues.

          Show
          Jake Mannix added a comment - setGraphState is called in GraphMapper, yes, but there are other places vertices get instantiated: MutableVertex#instantiateVertex() VertexResolver#instantiateVertex() In both of these classes, we have a handle on GraphState, so I guess we can just set it right after. I'll try that and see if there are any issues.
          Hide
          Avery Ching added a comment -

          Regarding 1) I think it would be safer for use to take away setGraphState() from createVertex to avoid anyone mucking with the GraphState in the VertexInputFormat. It's kind of a nasty thing and would be preferable for users to not know about it. Actually, in our current code, we call setGraphState prior to processing each vertex anyway (see GraphMapper.java), so the graph state not being set shouldn't be an issue.

          As far as 2), I really am curious to know what the use case for it is. I personally think it's a little weird. Currently, we don't do any processing on messages for superstep -1 (input superstep). So I guess the messages would be processed in superstep 0?

          Anyway, once we figure out 1), please work on your formatting (i.e. a couple of if(cond) instead of if (cond) and at least keep spaces consistent per file. Then we can commit it. Oh, and if there are any warnings, please take care of them. I picked up a lot of warnings from a recent Giraph commit...

          Show
          Avery Ching added a comment - Regarding 1) I think it would be safer for use to take away setGraphState() from createVertex to avoid anyone mucking with the GraphState in the VertexInputFormat. It's kind of a nasty thing and would be preferable for users to not know about it. Actually, in our current code, we call setGraphState prior to processing each vertex anyway (see GraphMapper.java), so the graph state not being set shouldn't be an issue. As far as 2), I really am curious to know what the use case for it is. I personally think it's a little weird. Currently, we don't do any processing on messages for superstep -1 (input superstep). So I guess the messages would be processed in superstep 0? Anyway, once we figure out 1), please work on your formatting (i.e. a couple of if(cond) instead of if (cond) and at least keep spaces consistent per file. Then we can commit it. Oh, and if there are any warnings, please take care of them. I picked up a lot of warnings from a recent Giraph commit...
          Hide
          Jake Mannix added a comment -

          1) BspUtils.createVertex(Configuration conf, GraphState<I,V,E,M> graphState) requires access to the GraphState for instantiation, currently. We could avoid it by taking that setGraphState() away from that method and leaving it in wherever it gets first used (GraphMapper?), but why not be safe, and always set it right after instantiation, so you know that there's no other place where someone decides to do BspUtils.createVertex(), but forgets to then setGraphState() on it.

          2) I really don't know whether it makes sense to be able to instantiate "in-flight" messages with vertices. I just wanted to future-proof the API a little bit by allowing for the possibility. I'm fine either way.

          Show
          Jake Mannix added a comment - 1) BspUtils.createVertex(Configuration conf, GraphState<I,V,E,M> graphState) requires access to the GraphState for instantiation, currently. We could avoid it by taking that setGraphState() away from that method and leaving it in wherever it gets first used (GraphMapper?), but why not be safe, and always set it right after instantiation, so you know that there's no other place where someone decides to do BspUtils.createVertex(), but forgets to then setGraphState() on it. 2) I really don't know whether it makes sense to be able to instantiate "in-flight" messages with vertices. I just wanted to future-proof the API a little bit by allowing for the possibility. I'm fine either way.
          Hide
          Avery Ching added a comment -

          Jake,

          Your suggestion on 1) will work, but couldn't we also just use the infrastructure to set graph state after getCurrentVertex()? There is no need to use the GraphState state at that point of the application. It potentially will prevent users from making any mistakes and complicating GraphState, which is owned by the Worker.

          Per 2), this is a downside of changing the interface I think. I guess we will have to live with the additional M type. It's not a big deal to me. That being said, we should prevent users from initializing vertices with messages from the vertex reader since I don't think that makes any sense...or does it?

          Show
          Avery Ching added a comment - Jake, Your suggestion on 1) will work, but couldn't we also just use the infrastructure to set graph state after getCurrentVertex()? There is no need to use the GraphState state at that point of the application. It potentially will prevent users from making any mistakes and complicating GraphState, which is owned by the Worker. Per 2), this is a downside of changing the interface I think. I guess we will have to live with the additional M type. It's not a big deal to me. That being said, we should prevent users from initializing vertices with messages from the vertex reader since I don't think that makes any sense...or does it?
          Hide
          Jake Mannix added a comment -

          s/bit/big/

          Show
          Jake Mannix added a comment - s/bit/big/
          Hide
          Jake Mannix added a comment -

          Yeah, sorry it's so bit, Hyunsik - GIRAPH-28 kinda turned into a bit of a yak-shaving affair, I know!

          Show
          Jake Mannix added a comment - Yeah, sorry it's so bit, Hyunsik - GIRAPH-28 kinda turned into a bit of a yak-shaving affair, I know!
          Hide
          Hyunsik Choi added a comment -

          Looks great!

          Actually, I need more time to fully keep up with this patch.
          First of all, I have executed unit tests on real hadoop cluster running on local host.
          All tests are passed!

          Show
          Hyunsik Choi added a comment - Looks great! Actually, I need more time to fully keep up with this patch. First of all, I have executed unit tests on real hadoop cluster running on local host. All tests are passed!
          Hide
          Jake Mannix added a comment -

          Regarding 1) - I guess I can see how VertexReader could be an abstract base class which hangs onto the GraphState, and instantiates the BasicVertex instances for its subclasses, using the private GraphState reference, and passing it to an abstract initializeVertex(BasicVertex<I,V,E,M> v) method to be populated via the (newly added in this patch) BasicVertex#initialize(<lotso args>) method.

          Thinking further on 2) - am I just not seeing how we could easily make VertexReader act like an Iterator / factory for BasicVertex instances without typing it by the type of the thing it's to instantiate?

          Show
          Jake Mannix added a comment - Regarding 1) - I guess I can see how VertexReader could be an abstract base class which hangs onto the GraphState, and instantiates the BasicVertex instances for its subclasses, using the private GraphState reference, and passing it to an abstract initializeVertex(BasicVertex<I,V,E,M> v) method to be populated via the (newly added in this patch) BasicVertex#initialize(<lotso args>) method. Thinking further on 2) - am I just not seeing how we could easily make VertexReader act like an Iterator / factory for BasicVertex instances without typing it by the type of the thing it's to instantiate?
          Hide
          Jake Mannix added a comment -

          Hey Avery,

          Glad you could get it to all work (unit-test-wise) for you - did you run the unit tests talking to a real cluster, or just localhost? I haven't tried the former yet...

          Regarding the comments: I'm fine keeping GraphState hidden to the implementation details, but will it always work in practice? I mean, all of the state in there used to be global static state in Vertex, and all subclasses had full read/write access to it. We've segregated it in a non-static object which we're treating as a singleton (but aren't forcing it to be one programmaticly), but can we really hide it?

          For example, BasicVertex#sendMsg() requires the WorkerCommunications, which it gets from the GraphMapper, which it gets from the GraphState. Do you think we should implement all the things we think the user will need from the GraphMapper directly in BasicVertex (or MutableVertex), and remove the getGraphState() method entirely, or just keep it package private for now.

          In general, I like keeping impl details away from developers too, but this early in the project, we also shouldn't hamstring ourselves by thinking there is a difference from a "casual user" of the system, and someone willing to hack around a bit.

          2) If VertexReader is going to return a BasicVertex<I,V,E,M> from getCurrentVertex() to callers who really need it to be fully specified nice and type-safely, what choice do we have? From a pratical, rather than API standpoint, is it possible we'd someday want to be able to read up a collection of vertices with initial messages ready to be processed? Not sure if that matters.

          3) I tried, and can clean up as possible, but I really rely on my IDE settings for that. Not being able to set autoformat on means I'm pretty doomed, but if you see places where I can clean up manually, I can do that, sure.

          Show
          Jake Mannix added a comment - Hey Avery, Glad you could get it to all work (unit-test-wise) for you - did you run the unit tests talking to a real cluster, or just localhost? I haven't tried the former yet... Regarding the comments: I'm fine keeping GraphState hidden to the implementation details, but will it always work in practice? I mean, all of the state in there used to be global static state in Vertex, and all subclasses had full read/write access to it. We've segregated it in a non-static object which we're treating as a singleton (but aren't forcing it to be one programmaticly), but can we really hide it? For example, BasicVertex#sendMsg() requires the WorkerCommunications, which it gets from the GraphMapper, which it gets from the GraphState. Do you think we should implement all the things we think the user will need from the GraphMapper directly in BasicVertex (or MutableVertex), and remove the getGraphState() method entirely, or just keep it package private for now. In general, I like keeping impl details away from developers too, but this early in the project, we also shouldn't hamstring ourselves by thinking there is a difference from a "casual user" of the system, and someone willing to hack around a bit. 2) If VertexReader is going to return a BasicVertex<I,V,E,M> from getCurrentVertex() to callers who really need it to be fully specified nice and type-safely, what choice do we have? From a pratical, rather than API standpoint, is it possible we'd someday want to be able to read up a collection of vertices with initial messages ready to be processed? Not sure if that matters. 3) I tried, and can clean up as possible, but I really rely on my IDE settings for that. Not being able to set autoformat on means I'm pretty doomed, but if you see places where I can clean up manually, I can do that, sure.
          Hide
          Avery Ching added a comment -

          Tried out your stuff Jake. Unittests passed for me. I took a quick look at it and was happy to see the new input format style interfaces (nextVertex() and getCurrentVertex()). Also very cool to see LongDoubleFloatDoubleVertex in action as a proof-of-concept non-Vertex implementation. Couple of questions/comments.

          1) I would personally prefer that GraphState not be exposed to developers/users from VertexReader. We can always set the GraphState from the infrastructure after the vertices have been read.

          2) Does VertexReader really need the message type M? There aren't any messages at that point and complicates things a bit.

          3) I know it's not ready for primetime as you mentioned, but while we currently have two styles of formatting, hopefully we can at least keep files consistent until one of us fixes everything to the new convention.

          Show
          Avery Ching added a comment - Tried out your stuff Jake. Unittests passed for me. I took a quick look at it and was happy to see the new input format style interfaces (nextVertex() and getCurrentVertex()). Also very cool to see LongDoubleFloatDoubleVertex in action as a proof-of-concept non-Vertex implementation. Couple of questions/comments. 1) I would personally prefer that GraphState not be exposed to developers/users from VertexReader. We can always set the GraphState from the infrastructure after the vertices have been read. 2) Does VertexReader really need the message type M? There aren't any messages at that point and complicates things a bit. 3) I know it's not ready for primetime as you mentioned, but while we currently have two styles of formatting, hopefully we can at least keep files consistent until one of us fixes everything to the new convention.
          Hide
          Jake Mannix added a comment -

          Why does JIRA have a "attach file" action, which allows comments, and then a totally separate "submit patch" action, which also allows comments, but does not let you actually submit a file?

          Show
          Jake Mannix added a comment - Why does JIRA have a "attach file" action, which allows comments, and then a totally separate "submit patch" action, which also allows comments, but does not let you actually submit a file?
          Hide
          Jake Mannix added a comment -

          This patch is gigantic, sorry. It resolves this current ticket by simultaneously resolving GIRAPH-28 as a proof-of-concept:

          Introduces a primitive-only BasicVertex subclass which is not a MutableVertex, let alone a subclass of Vertex. In particular, to verify nontriviality, it actually replaces the SimplePageRankVertex with a subclass of LongDoubleFloatDoubleVertex (the aforemetioned example primitive vertex).

          This required some extensive refactoring, all over the place. Unit tests currently pass on my box, but I doubt that this patch is really ready for prime-time.

          Try it out!

          (Note: code style is probably also a mess, again, sorry. Couldn't find a checksyle set up anywhere with the new guidelines... and it looks like most of the old code still uses the old whitespace rules anyways...)

          Show
          Jake Mannix added a comment - This patch is gigantic, sorry. It resolves this current ticket by simultaneously resolving GIRAPH-28 as a proof-of-concept: Introduces a primitive-only BasicVertex subclass which is not a MutableVertex, let alone a subclass of Vertex. In particular, to verify nontriviality, it actually replaces the SimplePageRankVertex with a subclass of LongDoubleFloatDoubleVertex (the aforemetioned example primitive vertex). This required some extensive refactoring, all over the place. Unit tests currently pass on my box, but I doubt that this patch is really ready for prime-time. Try it out! (Note: code style is probably also a mess, again, sorry. Couldn't find a checksyle set up anywhere with the new guidelines... and it looks like most of the old code still uses the old whitespace rules anyways...)
          Hide
          Jake Mannix added a comment -

          Hmm... on further thought: passing getRecordReader().getCurrentValue() requires the parsing happen in the Vertex implementation. That work is the primary job of the VertexReader, so I think we're going to need something like:

          class BasicVertex<I,V,E,M> {
            abstract void initialize(I id, V val, Map<I,E> edges, Iterable<M> message);
          }
          

          Of course, we could also just use gasp a constructor?

          Show
          Jake Mannix added a comment - Hmm... on further thought: passing getRecordReader().getCurrentValue() requires the parsing happen in the Vertex implementation. That work is the primary job of the VertexReader, so I think we're going to need something like: class BasicVertex<I,V,E,M> { abstract void initialize(I id, V val, Map<I,E> edges, Iterable<M> message); } Of course, we could also just use gasp a constructor?
          Hide
          Jake Mannix added a comment -

          Ok, I'm digging into this again, finally... man patches get pretty stale if left unattended in the Incubator.

          So, the idea with keeping BasicVertex which is not mutable is not for performance, but for implementation ease / coding safety: forcing users to implement mutating methods on their graph which doesn't allow such things is unfair to the users of this library, and having them throw UnsupportedOperationException should be avoided because then lots of problems are only seen at runtime, and then when the project moves forward and adds some innocuous thing which passes all unit tests, but ends up calling some mutator method that wasn't called before (maybe as part of a "post-initialization check of correctness, or something"), the users who are throwing exceptions in these methods only find out when they run their jobs which used to succeed, and now die with mysterious UOE getting thrown in new methods nowhere near their code. Bad news, if it can be avoided.

          So I'm in favor of a) over b), in theory. It's closest to the way things are done in hadoop, and seems the cleanest, from a purity of API standpoint.

          However, having dug into the code a bit, it's going to be hairy to get a) to work easily, now that there is a proliferation of VertexInputFormat stuff that Jakob has been committing lately (yay! but boo. ) - all of this code would also need to be retrofitted to match.

          So I'm going to take a stab at b). It breaks the API least, and while it appears to allow mutation, it's pretty clear to the user that it's not really, and we can forcibly throw exceptions on multiple use, for example, to keep it from being reused.

          Show
          Jake Mannix added a comment - Ok, I'm digging into this again, finally... man patches get pretty stale if left unattended in the Incubator. So, the idea with keeping BasicVertex which is not mutable is not for performance, but for implementation ease / coding safety: forcing users to implement mutating methods on their graph which doesn't allow such things is unfair to the users of this library, and having them throw UnsupportedOperationException should be avoided because then lots of problems are only seen at runtime, and then when the project moves forward and adds some innocuous thing which passes all unit tests, but ends up calling some mutator method that wasn't called before (maybe as part of a "post-initialization check of correctness, or something"), the users who are throwing exceptions in these methods only find out when they run their jobs which used to succeed, and now die with mysterious UOE getting thrown in new methods nowhere near their code. Bad news, if it can be avoided. So I'm in favor of a) over b), in theory. It's closest to the way things are done in hadoop, and seems the cleanest, from a purity of API standpoint. However, having dug into the code a bit, it's going to be hairy to get a) to work easily, now that there is a proliferation of VertexInputFormat stuff that Jakob has been committing lately (yay! but boo. ) - all of this code would also need to be retrofitted to match. So I'm going to take a stab at b). It breaks the API least, and while it appears to allow mutation, it's pretty clear to the user that it's not really, and we can forcibly throw exceptions on multiple use, for example, to keep it from being reused.
          Hide
          Claudio Martella added a comment -

          of course i meant either (a) or (b)

          Show
          Claudio Martella added a comment - of course i meant either (a) or (b)
          Hide
          Claudio Martella added a comment -

          I understand. well, although I don't see how it could be really optimized on the fact a class doesn't present a subset of the methods (the mutating methods) on performance, I could understand on a principle.

          Anyway I think it could make sense to follow your distinction between mutation before and after instantiation, and I believe a good way of doing this could be:

          (a) initialize the object through Writable interface readField() at instantiation. This would allow us basically to re-use code we already have (readField() implementation) and not break any underlying assumption about mutation. The bad part is that people would have to provide storage on DataInputStream instead of the current VertexInputFormat.

          (b) explicitly create initialize() method to do the "pre instantiation" init code, which would be added to BasicVertex. This code would be called inside of VertexInputFormat.next() for instance by passing the content of RecordReader.getCurrentValue(), i.e. vertex.initialize(getRecordReader().getCurrentValue()). This is probably the cleanest solution as it doesn't require to change the current API and also quite matches with the Factory-based current creation of vertices.

          Show
          Claudio Martella added a comment - I understand. well, although I don't see how it could be really optimized on the fact a class doesn't present a subset of the methods (the mutating methods) on performance, I could understand on a principle. Anyway I think it could make sense to follow your distinction between mutation before and after instantiation, and I believe a good way of doing this could be: (a) initialize the object through Writable interface readField() at instantiation. This would allow us basically to re-use code we already have (readField() implementation) and not break any underlying assumption about mutation. The bad part is that people would have to provide storage on DataInputStream instead of the current VertexInputFormat. (b) explicitly create initialize() method to do the "pre instantiation" init code, which would be added to BasicVertex. This code would be called inside of VertexInputFormat.next() for instance by passing the content of RecordReader.getCurrentValue(), i.e. vertex.initialize(getRecordReader().getCurrentValue()). This is probably the cleanest solution as it doesn't require to change the current API and also quite matches with the Factory-based current creation of vertices.
          Hide
          Jake Mannix added a comment -

          Hey Claudio,

          The reason why this has been slow going is exactly your last line: MutableVertex is a bit of a beast - it's essentially assumed that all vertex implementations are a subclass of Vertex extends MutableVertex, which is the whole point of this ticket. The readField() method and the VertexInputFormats can be modified, and that's what I've been working on: they essentially are designed currently to pretend you're reading from raw Text and calling setters on the (Mutable)Vertex you've instantiated. But the more sensible thing is to have the BasicVertex implement Writable, which lets it read in data in BspServiceWorker, but also implement an interface which lets it read lines of Text, say, and initialize that way, if it's not a MutableVertex.

          Its a little ugly, maybe there are better ways.

          You were suggesting just force everyone to be MutableVertex? I think the idea of having a distinction is good - there may be optimizations possible in the case where you know your graph will never change in structure after instantiation (a very common case, for most graph algorithms). The trick is the "after instantiation" part.

          Show
          Jake Mannix added a comment - Hey Claudio, The reason why this has been slow going is exactly your last line: MutableVertex is a bit of a beast - it's essentially assumed that all vertex implementations are a subclass of Vertex extends MutableVertex, which is the whole point of this ticket. The readField() method and the VertexInputFormats can be modified, and that's what I've been working on: they essentially are designed currently to pretend you're reading from raw Text and calling setters on the (Mutable)Vertex you've instantiated. But the more sensible thing is to have the BasicVertex implement Writable, which lets it read in data in BspServiceWorker, but also implement an interface which lets it read lines of Text, say, and initialize that way, if it's not a MutableVertex. Its a little ugly, maybe there are better ways. You were suggesting just force everyone to be MutableVertex? I think the idea of having a distinction is good - there may be optimizations possible in the case where you know your graph will never change in structure after instantiation (a very common case, for most graph algorithms). The trick is the "after instantiation" part.
          Hide
          Claudio Martella added a comment -

          I've been working on contributing to this patch. I basically substituted everywhere the system expects Vertex a BasicVertex. This does work until I reach BspServiceWorker.loadVertexRange() where vertex.readField() is used to restore a checkpoint. The problem is that Writable is implemented by MutableVertex instead of BasicVertex. This problem is also present in VertexInputFormats where we would call mutating methods while creating the vertex, so we cannot really rely on BasicVertex. If we use MutableVertex on that API, then we require the user apps to extends MutableVertex.

          So I guess the idea of this refactor is either to switch to MutableVertex, create a Vertex that is not finalized or just unify the whole thing.

          What do you think?

          Show
          Claudio Martella added a comment - I've been working on contributing to this patch. I basically substituted everywhere the system expects Vertex a BasicVertex. This does work until I reach BspServiceWorker.loadVertexRange() where vertex.readField() is used to restore a checkpoint. The problem is that Writable is implemented by MutableVertex instead of BasicVertex. This problem is also present in VertexInputFormats where we would call mutating methods while creating the vertex, so we cannot really rely on BasicVertex. If we use MutableVertex on that API, then we require the user apps to extends MutableVertex. So I guess the idea of this refactor is either to switch to MutableVertex, create a Vertex that is not finalized or just unify the whole thing. What do you think?
          Hide
          Claudio Martella added a comment -

          Hey Jake, did you get any chance to fix this one lately?

          This is quite blocking for me, I'm based on BasicVertex. Can I help out somehow?

          Show
          Claudio Martella added a comment - Hey Jake, did you get any chance to fix this one lately? This is quite blocking for me, I'm based on BasicVertex. Can I help out somehow?
          Hide
          Jake Mannix added a comment -

          Looking into this a bit further, I think I do like the idea of:

          VertexReader:

          boolean nextVertex() throws IOException, InterruptedException;
          BasicVertex<I, V, E, M> getCurrentVertex() throws IOException, InterruptedException;

          The trick is that now VertexReader is the one doing the instantiation of the BasicVertex instances, which means it will need a handle on the GraphState, but that's fine - the only place where the VertexInputFormat is currently instantiated is in BspServiceMaster and BspServiceWorker, where there is a handle on the GraphState anyways.

          I'm just going to try and finish coding this up with the patch to GIRAPH-28 also applied, so I have a proven use-case already in hand to verify it makes sense / works on a cluster.

          Show
          Jake Mannix added a comment - Looking into this a bit further, I think I do like the idea of: VertexReader: boolean nextVertex() throws IOException, InterruptedException; BasicVertex<I, V, E, M> getCurrentVertex() throws IOException, InterruptedException; The trick is that now VertexReader is the one doing the instantiation of the BasicVertex instances, which means it will need a handle on the GraphState, but that's fine - the only place where the VertexInputFormat is currently instantiated is in BspServiceMaster and BspServiceWorker, where there is a handle on the GraphState anyways. I'm just going to try and finish coding this up with the patch to GIRAPH-28 also applied, so I have a proven use-case already in hand to verify it makes sense / works on a cluster.
          Hide
          Jake Mannix added a comment -

          Oh, I'm not saying we shouldn't have I,V,E,M all implement Writable - we should, and that makes it really easy to implement Writable in BasicVertex.

          Show
          Jake Mannix added a comment - Oh, I'm not saying we shouldn't have I,V,E,M all implement Writable - we should, and that makes it really easy to implement Writable in BasicVertex.
          Hide
          Avery Ching added a comment -

          There is an out-of-core part as well though (checkpointing). Forcing the types (I, V, E, M) to implement Writable seemed like a nice easy way for Hadoop users to jump into Giraph. Also, it provides us a lot of reusable objects (IntWritable, FloatWritable, DoubleWritable, etc.). I am open to other ideas of course. Let me know your thoughts.

          Show
          Avery Ching added a comment - There is an out-of-core part as well though (checkpointing). Forcing the types (I, V, E, M) to implement Writable seemed like a nice easy way for Hadoop users to jump into Giraph. Also, it provides us a lot of reusable objects (IntWritable, FloatWritable, DoubleWritable, etc.). I am open to other ideas of course. Let me know your thoughts.
          Hide
          Jake Mannix added a comment -

          Yeah, I having it return the current vertex sounds good, I guess. There's still something nagging at me about the way Writables are being used: Giraph is different from Hadoop: there's a persistent, in-memory data structure being built here, where there isn't in Hadoop. Regardless of how we read the data, or send the data over the wire, or write it to disk, we're also hanging onto it. I wonder if we need to make the abstraction around that more clear?

          Maybe simply solving the title of this JIRA ticket would do the trick, which would at a minimum require that BasicVertex implement Writable, and other than that, it could work with VertexReader API's of either flavor.

          I think I can try working on this ticket without monkeying with the VertexReader API, but I won't know until I start unravelling this ball of string a bit.

          Show
          Jake Mannix added a comment - Yeah, I having it return the current vertex sounds good, I guess. There's still something nagging at me about the way Writables are being used: Giraph is different from Hadoop: there's a persistent, in-memory data structure being built here, where there isn't in Hadoop. Regardless of how we read the data, or send the data over the wire, or write it to disk, we're also hanging onto it. I wonder if we need to make the abstraction around that more clear? Maybe simply solving the title of this JIRA ticket would do the trick, which would at a minimum require that BasicVertex implement Writable, and other than that, it could work with VertexReader API's of either flavor. I think I can try working on this ticket without monkeying with the VertexReader API, but I won't know until I start unravelling this ball of string a bit.
          Hide
          Avery Ching added a comment -

          The reason for the current VertexReader API was to match the old Hadoop RecordReader API and make it natural for folks to move to vertices instead of keys and values. The old Hadoop RecordReader API

          org.apache.hadoop.mapred.RecordReader

          boolean next(K key, V value) throws IOException;

          and the current VertexReader API is

          boolean next(MutableVertex<I, V, E, ?> vertex)
          throws IOException, InterruptedException;

          That being said, the new Hadoop RecordReader API is different:

          org.apache.hadoop.mapreduce.RecordReader

          boolean nextKeyValue() throws IOException, InterruptedException;
          KEYIN getCurrentKey() throws IOException, InterruptedException;
          VALUEIN getCurrentValue() throws IOException, InterruptedException;

          It's probably easier to follow that (especially regarding your points). Given it's a user facing API we should get a few more opinions on it though. I imagine the change would be something closer to:

          boolean nextVertex() throws IOException, InterruptedException;
          BasicVertex<I, V, E, M> getCurrentVertex() throws IOException, InterruptedException;

          As far as the questions about BasicVertex and MutableVertex, the general idea would be that BasicVertex would be a safer interface to use whenever possible. However, the Vertex class hierarchy has evolved and I wouldn't mind changing it since it's not really as useful as it should be. In general, we should only provide the interfaces necessary for each method to ensure we (or the users) can't do something stupid. So probably a (nearly) immutable interface for storage, one for the user to access their methods, etc...

          Show
          Avery Ching added a comment - The reason for the current VertexReader API was to match the old Hadoop RecordReader API and make it natural for folks to move to vertices instead of keys and values. The old Hadoop RecordReader API org.apache.hadoop.mapred.RecordReader boolean next(K key, V value) throws IOException; and the current VertexReader API is boolean next(MutableVertex<I, V, E, ?> vertex) throws IOException, InterruptedException; That being said, the new Hadoop RecordReader API is different: org.apache.hadoop.mapreduce.RecordReader boolean nextKeyValue() throws IOException, InterruptedException; KEYIN getCurrentKey() throws IOException, InterruptedException; VALUEIN getCurrentValue() throws IOException, InterruptedException; It's probably easier to follow that (especially regarding your points). Given it's a user facing API we should get a few more opinions on it though. I imagine the change would be something closer to: boolean nextVertex() throws IOException, InterruptedException; BasicVertex<I, V, E, M> getCurrentVertex() throws IOException, InterruptedException; As far as the questions about BasicVertex and MutableVertex, the general idea would be that BasicVertex would be a safer interface to use whenever possible. However, the Vertex class hierarchy has evolved and I wouldn't mind changing it since it's not really as useful as it should be. In general, we should only provide the interfaces necessary for each method to ensure we (or the users) can't do something stupid. So probably a (nearly) immutable interface for storage, one for the user to access their methods, etc...
          Hide
          Jake Mannix added a comment -

          In fact, thinking about VertexReader further, it seems its entire API is a little backwards. Why are we passing in instantiated Vertices, and filling them in? Shouldn't they effectively be "iterators" over the InputSplit?

          Show
          Jake Mannix added a comment - In fact, thinking about VertexReader further, it seems its entire API is a little backwards. Why are we passing in instantiated Vertices, and filling them in? Shouldn't they effectively be "iterators" over the InputSplit?
          Hide
          Jake Mannix added a comment -

          Initial thoughts:

          VertexReader defines a "next(MutableVertex vertex)" method, which does the sensible thing of filling in the vertex from the HDFS block, and because it takes a vertex object and messes with it, it's natural that the vertex be "required" to be a MutableVertex.

          But of course this implies that everything be a MutableVertex, because if you can't be read in by a VertexReader, where do you get instantiated at all? If BasicVertex implements Writable, we could always readFields() data in, but not allow mutation, but this seems like it would interfere with the way VertexReader allows users to read straight from Text, etc. This would allow VertexList to extend ArrayList<BasicVertex> instead of ArrayList<Vertex>, at the same time.

          Anyone have any thoughts/ideas? Are we wedded to making VertexReader implementations deal with MutableVertex, or can we swap them to handle Writable BasicVertex?

          Show
          Jake Mannix added a comment - Initial thoughts: VertexReader defines a "next(MutableVertex vertex)" method, which does the sensible thing of filling in the vertex from the HDFS block, and because it takes a vertex object and messes with it, it's natural that the vertex be "required" to be a MutableVertex. But of course this implies that everything be a MutableVertex, because if you can't be read in by a VertexReader, where do you get instantiated at all? If BasicVertex implements Writable, we could always readFields() data in, but not allow mutation, but this seems like it would interfere with the way VertexReader allows users to read straight from Text, etc. This would allow VertexList to extend ArrayList<BasicVertex> instead of ArrayList<Vertex>, at the same time. Anyone have any thoughts/ideas? Are we wedded to making VertexReader implementations deal with MutableVertex, or can we swap them to handle Writable BasicVertex?

            People

            • Assignee:
              Jake Mannix
              Reporter:
              Jake Mannix
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development