Giraph
  1. Giraph
  2. GIRAPH-28

Introduce new primitive-specific MutableVertex subclasses

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0
    • Fix Version/s: 0.1.0
    • Component/s: graph
    • Labels:
      None

      Description

      As discussed on the list, MutableVertex<LongWritable,DoubleWritable,FloatWritable,DoubleWritable> (for example) could be highly optimized in its memory footprint if the vertex and edge data were held in a form which minimized Java object usage.

      1. GIRAPH-28.diff
        37 kB
        Jake Mannix
      2. GIRAPH-28.diff
        40 kB
        Jake Mannix
      3. GIRAPH-28.diff
        35 kB
        Jake Mannix

        Issue Links

          Activity

          Hide
          Jake Mannix added a comment -

          This was included in GIRAPH-36

          Show
          Jake Mannix added a comment - This was included in GIRAPH-36
          Hide
          Jake Mannix added a comment -

          Most likely even if unit tests are passing, running on a real cluster with primitive-specific subclasses will fail if there are vertex range balancing happening.

          Show
          Jake Mannix added a comment - Most likely even if unit tests are passing, running on a real cluster with primitive-specific subclasses will fail if there are vertex range balancing happening.
          Hide
          Jake Mannix added a comment -

          Newly regenerated against trunk.

          Show
          Jake Mannix added a comment - Newly regenerated against trunk.
          Hide
          Jake Mannix added a comment -

          I don't know what it was, I just re-patched with current trunk, after the refactorings of the most recent few patches. Memory use dropped to what it should be!

          Show
          Jake Mannix added a comment - I don't know what it was, I just re-patched with current trunk, after the refactorings of the most recent few patches. Memory use dropped to what it should be!
          Hide
          Dmitriy V. Ryaboy added a comment -

          Errrr Jake. Sorry. Clearly, I can't tell you remote search people apart :-P

          Show
          Dmitriy V. Ryaboy added a comment - Errrr Jake. Sorry. Clearly, I can't tell you remote search people apart :-P
          Hide
          Dmitriy V. Ryaboy added a comment -

          Nice job, Jimmy. Looks like you managed to get rid of something that was sucking in memory in both Object and LDFD.. what was it?

          Show
          Dmitriy V. Ryaboy added a comment - Nice job, Jimmy. Looks like you managed to get rid of something that was sucking in memory in both Object and LDFD.. what was it?
          Hide
          Jake Mannix added a comment -

          Ok another patch coming soon for this, but good news: this is the output of the object size calculator now:
          (key: Primitive is what Dmitriy put in that test code, LDFD is a trivial class which extends the new LongDoubleFloatDoubleVertex class, and shows exactly the same memory as this)

          Tiny: 0 840
          Object: 0 872
          Primitive: 0 4536
          LDFD: 0 4536
          Tiny: 1 840
          Object: 1 976
          Primitive: 1 4536
          LDFD: 1 4536
          Tiny: 10 840
          Object: 10 1912
          Primitive: 10 4536
          LDFD: 10 4536
          Tiny: 100 2640
          Object: 100 11272
          Primitive: 100 4536
          LDFD: 100 4536
          Tiny: 1000 16080
          Object: 1000 104872
          Primitive: 1000 46784
          LDFD: 1000 46784
          Tiny: 10000 123600
          Object: 10000 1040872
          Primitive: 10000 302000
          LDFD: 10000 302000

          Show
          Jake Mannix added a comment - Ok another patch coming soon for this, but good news: this is the output of the object size calculator now: (key: Primitive is what Dmitriy put in that test code, LDFD is a trivial class which extends the new LongDoubleFloatDoubleVertex class, and shows exactly the same memory as this) Tiny: 0 840 Object: 0 872 Primitive: 0 4536 LDFD: 0 4536 Tiny: 1 840 Object: 1 976 Primitive: 1 4536 LDFD: 1 4536 Tiny: 10 840 Object: 10 1912 Primitive: 10 4536 LDFD: 10 4536 Tiny: 100 2640 Object: 100 11272 Primitive: 100 4536 LDFD: 100 4536 Tiny: 1000 16080 Object: 1000 104872 Primitive: 1000 46784 LDFD: 1000 46784 Tiny: 10000 123600 Object: 10000 1040872 Primitive: 10000 302000 LDFD: 10000 302000
          Hide
          Jake Mannix added a comment -

          ah, yes. That can be a nasty pit of snakes for new developers, no matter how commonly it's found in Hadoop-land.

          So I'll put in my vote for Iterable<I>, with your offline suggestion of boolean hasSortedIterator() (defaulting in BasicVertex to "return true;", overrideable in subclasses).

          And I'll put in a patch on GIRAPH-31 so my code'll be where my mouth is (and we can continue this discussion on a ticket with a shorter thread [so far]).

          Show
          Jake Mannix added a comment - ah, yes. That can be a nasty pit of snakes for new developers, no matter how commonly it's found in Hadoop-land. So I'll put in my vote for Iterable<I>, with your offline suggestion of boolean hasSortedIterator() (defaulting in BasicVertex to "return true;", overrideable in subclasses). And I'll put in a patch on GIRAPH-31 so my code'll be where my mouth is (and we can continue this discussion on a ticket with a shorter thread [so far] ).
          Hide
          Dmitriy V. Ryaboy added a comment -

          This:

          Alternatively, Edge could act just like a typical Writable, and the Iterator<Edge<I, E>> iterates over the same Edge object setting different values on it as next() is called.

          Show
          Dmitriy V. Ryaboy added a comment - This: Alternatively, Edge could act just like a typical Writable, and the Iterator<Edge<I, E>> iterates over the same Edge object setting different values on it as next() is called.
          Hide
          Jake Mannix added a comment -

          What do you mean by the "MutatorIterator" pattern? Not being clear about whether it's sorted or not? Or forcing iterator() to always be sorted? Or something else, about the X-Men series?

          Show
          Jake Mannix added a comment - What do you mean by the "MutatorIterator" pattern? Not being clear about whether it's sorted or not? Or forcing iterator() to always be sorted? Or something else, about the X-Men series?
          Hide
          Dmitriy V. Ryaboy added a comment -

          I'd caution against the approach of using a MutatorIterator (that's my name for that pattern. Like it? ).
          It's effective, but leads to extremely confusing bugs when people try to do things like take the first three edges, etc. Presenting a familiar interface but providing a tricky unintuitive implementation is not super friendly to developers; I don't think we want people to have to study the API to such an extent they have to know these details.

          Show
          Dmitriy V. Ryaboy added a comment - I'd caution against the approach of using a MutatorIterator (that's my name for that pattern. Like it? ). It's effective, but leads to extremely confusing bugs when people try to do things like take the first three edges, etc. Presenting a familiar interface but providing a tricky unintuitive implementation is not super friendly to developers; I don't think we want people to have to study the API to such an extent they have to know these details.
          Hide
          Jake Mannix added a comment -

          As for sorting, I'd imagine that assuming it always returns a sorted iterator is fine, but yes, some implementations I can imagine might not want to do that. I'd lean against having multiple iterators until it was known that they were needed, and maybe just document the ones which return nonsorted ones so that things don't get messed up?

          Vertex subclasses are where the "algorithms" are implemented, right? So a Vertex knows whether it has a sorted iterator or not... the only question would be: are there generic methods implemented in things like BspServiceWorker, or GraphMapper, which would be expected to need to do things to a sorted iterator? Currently there are no such places that I can see. Without such cases, we could easily leave Vertex implementations to decide whether they needed to return sorted iterators or not.

          Show
          Jake Mannix added a comment - As for sorting, I'd imagine that assuming it always returns a sorted iterator is fine, but yes, some implementations I can imagine might not want to do that. I'd lean against having multiple iterators until it was known that they were needed, and maybe just document the ones which return nonsorted ones so that things don't get messed up? Vertex subclasses are where the "algorithms" are implemented, right? So a Vertex knows whether it has a sorted iterator or not... the only question would be: are there generic methods implemented in things like BspServiceWorker, or GraphMapper, which would be expected to need to do things to a sorted iterator? Currently there are no such places that I can see. Without such cases, we could easily leave Vertex implementations to decide whether they needed to return sorted iterators or not.
          Hide
          Jake Mannix added a comment -

          The alternative to Iterable<Edge<I, E>> is Iterable<I>, returning only the target vertices, and you can call getEdgeValue(targetVertexId) on any of these if you need it. Again, many algorithms will simply do something like

          for(I targetId : vertex)

          { sendMsg(targetId, someFunction(baseMsg, getEdgeValue(targetId)); }

          which is maybe a little nicer looking (or at least not uglier) than:

          for(Edge<I, E> edge : vertex)

          { sendMsg(edge.getVertexId(), someFunction(baseMsg, edge.getValue()); }

          And then there are no Edge objects hanging around.

          Alternatively, Edge could act just like a typical Writable, and the Iterator<Edge<I, E>> iterates over the same Edge object setting different values on it as next() is called.

          Show
          Jake Mannix added a comment - The alternative to Iterable<Edge<I, E>> is Iterable<I>, returning only the target vertices, and you can call getEdgeValue(targetVertexId) on any of these if you need it. Again, many algorithms will simply do something like for(I targetId : vertex) { sendMsg(targetId, someFunction(baseMsg, getEdgeValue(targetId)); } which is maybe a little nicer looking (or at least not uglier) than: for(Edge<I, E> edge : vertex) { sendMsg(edge.getVertexId(), someFunction(baseMsg, edge.getValue()); } And then there are no Edge objects hanging around. Alternatively, Edge could act just like a typical Writable, and the Iterator<Edge<I, E>> iterates over the same Edge object setting different values on it as next() is called.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Technically you shouldn't have to use hasEdge when adding and removing if you don't care. removeEdge() can return null ambiguously (value was null, or no such edge existed), and if you care, you can use hasEdge(), but if you don't, you don't. addEdge() can be an upsert.

          Show
          Dmitriy V. Ryaboy added a comment - Technically you shouldn't have to use hasEdge when adding and removing if you don't care. removeEdge() can return null ambiguously (value was null, or no such edge existed), and if you care, you can use hasEdge(), but if you don't, you don't. addEdge() can be an upsert.
          Hide
          Avery Ching added a comment -

          boolean hasEdge(I targetVertexId) is fine as an alternative. For the iterators though, we are returning an iterator of type Edge<I, E>, so we will have to create those Edge<I, E> objects on the fly for some implementations. I suppose that hasEdge() adds a bit of work though to check before adding or removing as opposed to returning the full edge. I think I'd lean a slight bit in the Edge<I, E> methods for consistency and reduced code. But I can be out-voted. =)

          Show
          Avery Ching added a comment - boolean hasEdge(I targetVertexId) is fine as an alternative. For the iterators though, we are returning an iterator of type Edge<I, E>, so we will have to create those Edge<I, E> objects on the fly for some implementations. I suppose that hasEdge() adds a bit of work though to check before adding or removing as opposed to returning the full edge. I think I'd lean a slight bit in the Edge<I, E> methods for consistency and reduced code. But I can be out-voted. =)
          Hide
          Avery Ching added a comment -

          Would we guarantee that iterator() would always be sorted? For instance, what about something like TinyVertex? A pair of arrays would be expensive to keep sorted...

          Or are you suggesting that whether iterator is sorted or not depends on the implementation?

          I agree that implementing Iterable is pretty convenient for users. Perhaps have both (implement Iterable and the two other methods)? This would allow applications based on TinyVertex to be optimized when not requiring sorting.

          Show
          Avery Ching added a comment - Would we guarantee that iterator() would always be sorted? For instance, what about something like TinyVertex? A pair of arrays would be expensive to keep sorted... Or are you suggesting that whether iterator is sorted or not depends on the implementation? I agree that implementing Iterable is pretty convenient for users. Perhaps have both (implement Iterable and the two other methods)? This would allow applications based on TinyVertex to be optimized when not requiring sorting.
          Hide
          Jake Mannix added a comment -

          Also, to contradict my 1st and 3rd points, Dmitriy pointed out (in an out-of-band chat) that if we don't want to expose Edge to the user, because a) don't want to store it in memory (as his test showed that even switching TreeMap<I, Edge<I,E>> to TreeMap<I, E> reduced memory usage by a fair amount), and b) don't want to have to instantiate tons of useless objects by lazily creating them, we could instead just keep the getEdgeValue() and removeEdge() as they were, but also add a boolean hasEdge(I targetVertexId) to test for connection.

          Then you get everything you need without exposing the Edge class (which only gets used internally for its Writable nature):

          if(vertex.hasEdge(targetVertexId))

          { E value = vertex.getEdgeValue(targetVertexId); vertex.removeEdge(targetVertexId); }

          etc...

          Show
          Jake Mannix added a comment - Also, to contradict my 1st and 3rd points, Dmitriy pointed out (in an out-of-band chat) that if we don't want to expose Edge to the user, because a) don't want to store it in memory (as his test showed that even switching TreeMap<I, Edge<I,E>> to TreeMap<I, E> reduced memory usage by a fair amount), and b) don't want to have to instantiate tons of useless objects by lazily creating them, we could instead just keep the getEdgeValue() and removeEdge() as they were, but also add a boolean hasEdge(I targetVertexId) to test for connection. Then you get everything you need without exposing the Edge class (which only gets used internally for its Writable nature): if(vertex.hasEdge(targetVertexId)) { E value = vertex.getEdgeValue(targetVertexId); vertex.removeEdge(targetVertexId); } etc...
          Hide
          Jake Mannix added a comment -

          I'm suggesting that iterator() be always sorted. SortedMap implements Iterable (by way of Collection), and the iterator it returns is always in the sorted order. We'd have BasicVertex do the same thing.

          Show
          Jake Mannix added a comment - I'm suggesting that iterator() be always sorted. SortedMap implements Iterable (by way of Collection), and the iterator it returns is always in the sorted order. We'd have BasicVertex do the same thing.
          Hide
          Avery Ching added a comment -

          Jake, I agree with all 3 of your points. The only question I have is regarding implementing Iterable. Will the user expect to get back a sorted iterator or a non-sorted one? I'm guessing we'll lose some functionality here? I don't think we would want to add a "switch" of sorts to choose. The other way we had discussed way the methods

          Iterator<Edge<I, E>> getOutEdgeIterator();
          Iterator<Edge<I, E>> getSortedOutEdgeIterator();

          as opposed to Iterable.

          Show
          Avery Ching added a comment - Jake, I agree with all 3 of your points. The only question I have is regarding implementing Iterable. Will the user expect to get back a sorted iterator or a non-sorted one? I'm guessing we'll lose some functionality here? I don't think we would want to add a "switch" of sorts to choose. The other way we had discussed way the methods Iterator<Edge<I, E>> getOutEdgeIterator(); Iterator<Edge<I, E>> getSortedOutEdgeIterator(); as opposed to Iterable.
          Hide
          Jake Mannix added a comment -

          Ok, so I went ahead and made a 'straw-man' refactoring branch (on GitHub: https://github.com/jakemannix/giraph/tree/vertex_map_refactor ), removing the getDestEdgeMap() method, and having BasicVertex implement Iterable, as well as the random-access read method getEdgeValue(targetVertexId).

          I got it passing tests, but ran into a few things we may want to consider:

          testing for existence of a target vertex is no longer possible: getEdgeValue(targetVertexId) returns the value associated with the edge. Edges are allowed to have null values and still denote a connection between the source and target vertex, right? Maybe we should just have an Edge<I, E> getEdge(I targetVertexId) method instead?

          Secondly, far less importantly, is we need to have getNumOutEdges(), because clients often want to know the out-degree of the vertex, and they used to call getDestEdgeMap().size().

          Thirdly: for the same reason that getEdgeValue() can return superfluous nulls, removeEdge(), used as a boolean, can trick the caller into thinking there was no connection to the target (because removeEdge() returned null), but really it's because I was trying to be clever and return the value which could be null. Having removeEdge() return the actual Edge fixes this.

          I'll open another ticket for this stuff, as patching this patch seems a bit silly.

          Show
          Jake Mannix added a comment - Ok, so I went ahead and made a 'straw-man' refactoring branch (on GitHub: https://github.com/jakemannix/giraph/tree/vertex_map_refactor ), removing the getDestEdgeMap() method, and having BasicVertex implement Iterable, as well as the random-access read method getEdgeValue(targetVertexId). I got it passing tests, but ran into a few things we may want to consider: testing for existence of a target vertex is no longer possible: getEdgeValue(targetVertexId) returns the value associated with the edge. Edges are allowed to have null values and still denote a connection between the source and target vertex, right? Maybe we should just have an Edge<I, E> getEdge(I targetVertexId) method instead? Secondly, far less importantly, is we need to have getNumOutEdges(), because clients often want to know the out-degree of the vertex, and they used to call getDestEdgeMap().size(). Thirdly: for the same reason that getEdgeValue() can return superfluous nulls, removeEdge(), used as a boolean, can trick the caller into thinking there was no connection to the target (because removeEdge() returned null), but really it's because I was trying to be clever and return the value which could be null. Having removeEdge() return the actual Edge fixes this. I'll open another ticket for this stuff, as patching this patch seems a bit silly.
          Hide
          Jake Mannix added a comment -

          newly regenerated patch, should apply cleanly against trunk. Tests still pass (yay), but still not terribly clean code. I would argue that until we nail down the getOutpuEdgeMap() API, this patch should be held off on merging.

          But let's play with it more, and I think I'll turn Dmitriy's TinyVertex into another option in here, as it's a common use-case, and even smaller still than the current primitive container class.

          Show
          Jake Mannix added a comment - newly regenerated patch, should apply cleanly against trunk. Tests still pass (yay), but still not terribly clean code. I would argue that until we nail down the getOutpuEdgeMap() API, this patch should be held off on merging. But let's play with it more, and I think I'll turn Dmitriy's TinyVertex into another option in here, as it's a common use-case, and even smaller still than the current primitive container class.
          Hide
          Jake Mannix added a comment -

          re: Iterable - I was thinking we'd just pick on way to iterate over the edges, not multiple: we could either have the getSortedOutEdges(), or simply iterator(), as a consequence of implementing Iterable, but not both.

          The benefits of Iterable is the nice for-loop, and plays nice with google collections api: Iterables.any() / .concat() / .transform() / .partition() / etc.

          The downside is the non-informative method name.

          Show
          Jake Mannix added a comment - re: Iterable - I was thinking we'd just pick on way to iterate over the edges, not multiple: we could either have the getSortedOutEdges(), or simply iterator(), as a consequence of implementing Iterable, but not both. The benefits of Iterable is the nice for-loop, and plays nice with google collections api: Iterables.any() / .concat() / .transform() / .partition() / etc. The downside is the non-informative method name.
          Hide
          Avery Ching added a comment -

          I like free stuff =)

          Show
          Avery Ching added a comment - I like free stuff =)
          Hide
          Avery Ching added a comment -

          In regards to Jake's comment on BasicVertex implement Iterable<Edge<I, E>>, I don't have a strong opinion. Maybe too many ways to do the same thing (iterate over the edges)?

          Show
          Avery Ching added a comment - In regards to Jake's comment on BasicVertex implement Iterable<Edge<I, E>>, I don't have a strong opinion. Maybe too many ways to do the same thing (iterate over the edges)?
          Hide
          Jake Mannix added a comment -

          One more possible change: instead of "boolean removeEdge(I targetVertexId)", how about "E removeEdge(I targetVertexId)" - returning the value of the edge removed (or null if there wasn't an edge to the target). Might come in handy to return the edge, even if it's usually ignored, and it's a little more information for the caller "for free".

          Show
          Jake Mannix added a comment - One more possible change: instead of "boolean removeEdge(I targetVertexId)", how about "E removeEdge(I targetVertexId)" - returning the value of the edge removed (or null if there wasn't an edge to the target). Might come in handy to return the edge, even if it's usually ignored, and it's a little more information for the caller "for free".
          Hide
          Severin Corsten added a comment -

          I think this is a good idea. The most common use of the OutEdgeMap is to iterate over the values. The Map doesn't bring any additional information.

          And I like Jakes idea bud I don't know whether it brings any advantage and should be implemented in addition to a getter.

          Show
          Severin Corsten added a comment - I think this is a good idea. The most common use of the OutEdgeMap is to iterate over the values. The Map doesn't bring any additional information. And I like Jakes idea bud I don't know whether it brings any advantage and should be implemented in addition to a getter.
          Hide
          Jake Mannix added a comment -

          I like the Iterator more than ImmutableList, yeah, that's great. I wonder if then just making BasicVertex implement Iterable<Edge<I,E>> would be called for: for(Edge<I,E> edge : vertex)

          { ... }

          ? Not sure if that syntactic sugar is worth it.

          Show
          Jake Mannix added a comment - I like the Iterator more than ImmutableList, yeah, that's great. I wonder if then just making BasicVertex implement Iterable<Edge<I,E>> would be called for: for(Edge<I,E> edge : vertex) { ... } ? Not sure if that syntactic sugar is worth it.
          Hide
          Avery Ching added a comment -

          I certainly agree that providing interfaces that reduce memory consumption on the edges makes a lot of sense (thanks for the evidence).

          For supporting the all the implementations (especially primitive arrays implementation), perhaps

          Iterator<Edge<I, E>> getOutEdgeIterator();

          (Similar to the Pregel API makes sense). This would be easy to support with primitive arrays. Additionally, your suggested interfaces also make sense (slight modifications).

          E getEdgeValue(I targetVertexId);
          boolean removeEdge(I targetVertexId);
          Iterator<Edge<I, E>> getSortedOutEdgeIterator();

          What do you/others think? We should get others to weigh in as well, especially since it's a user facing API.

          Show
          Avery Ching added a comment - I certainly agree that providing interfaces that reduce memory consumption on the edges makes a lot of sense (thanks for the evidence). For supporting the all the implementations (especially primitive arrays implementation), perhaps Iterator<Edge<I, E>> getOutEdgeIterator(); (Similar to the Pregel API makes sense). This would be easy to support with primitive arrays. Additionally, your suggested interfaces also make sense (slight modifications). E getEdgeValue(I targetVertexId); boolean removeEdge(I targetVertexId); Iterator<Edge<I, E>> getSortedOutEdgeIterator(); What do you/others think? We should get others to weigh in as well, especially since it's a user facing API.
          Hide
          Jake Mannix added a comment -

          So Avery, the question I have for you is regarding the getOutEdgeMap() method - if we get rid of that, and instead maybe offer something like the other methods discussed on the list thread:

          E getEdge(I targetVertexId);
          ImmutableList<I> getSortedOutVertices();
          boolean removeEdge(I targetVertexId);

          we could do away with being tied to this TreeMap (although for now, keep it around in Vertex.java, as there's not much else possible in the generic object case, most likely), in addition to allowing me to remove my insane "pretend" SortedMap wrapper class.

          Show
          Jake Mannix added a comment - So Avery, the question I have for you is regarding the getOutEdgeMap() method - if we get rid of that, and instead maybe offer something like the other methods discussed on the list thread: E getEdge(I targetVertexId); ImmutableList<I> getSortedOutVertices(); boolean removeEdge(I targetVertexId); we could do away with being tied to this TreeMap (although for now, keep it around in Vertex.java, as there's not much else possible in the generic object case, most likely), in addition to allowing me to remove my insane "pretend" SortedMap wrapper class.
          Hide
          Jake Mannix added a comment -

          Yeah, I don't know what's eating up the space in my Primitive class, but I bet it has something to do with my "wrapper" SortedMap I return when you call getOutEdgeMap(). This object should be lazily instantiated, but as it has a "get**" name to it, I wonder if the ObjectSizeCalculator is considering the returned value to be part of the object's memory? If so, then it *should be what we see. Dmitriy, can you try re-running your original test, on the real LongDoubleFloatDoubleVertex, but with getOutEdgeMap() implemented as just "return null;", and similarly with getMsgList()? If you test doesn't use those methods. My guess is that the measured size will drop down to close to what your results for "Primitive Map" are.

          Regarding long[] and float[], it's actually not a totally crazy implementation, for the "final state" of the in-memory representation: most graph algorithms leave the graph structure alone (ie don't add or delete edges during iteration), in which case compacting the Vertex down to a pair of parallel sorted arrays after loading from the VertexReader is not such a unreasonable situation. Of course, it then requires that random-access reads to the outbound edges do a log(numOutEdges) binary search, but not all graph algorithms require lots of random access to specific edges (as opposed to bulk access to all edges).

          I could imagine two separate primitive implementations - one based on a map-like structure (a la OpenXxxYyyHashMap), which gives random access to the edges (but then not sorted access! Note that the current code in my primitive vertex throws UnsupportedOperationException for lots of SortedMap methods, which happen to be non exercised in the unit tests), and one based on parallel arrays, which allows for very fast sequential read-only access, but slower random read-only access, and very much slower mutation speed. I wrote implementations like this for the Mahout Vector interface, and named them almost exactly like that: RandomAccessSparseVector, and SequentialAccessSparseVector (although I should have called the latter something like ReadOnlySequentialAccessSparseVector, but it would have been a lie, as you can modify it, you just shouldn't).

          These numbers now make a lot of sense - Objects Are Big, that's the moral of the story. Any place you need to hang onto gazillions of things in memory on the JVM, you should do your best to stuff them into primitive arrays.

          Show
          Jake Mannix added a comment - Yeah, I don't know what's eating up the space in my Primitive class, but I bet it has something to do with my "wrapper" SortedMap I return when you call getOutEdgeMap(). This object should be lazily instantiated, but as it has a "get** " name to it, I wonder if the ObjectSizeCalculator is considering the returned value to be part of the object's memory? If so, then it *should be what we see. Dmitriy, can you try re-running your original test, on the real LongDoubleFloatDoubleVertex, but with getOutEdgeMap() implemented as just "return null;", and similarly with getMsgList()? If you test doesn't use those methods. My guess is that the measured size will drop down to close to what your results for "Primitive Map" are. Regarding long[] and float[], it's actually not a totally crazy implementation, for the "final state" of the in-memory representation: most graph algorithms leave the graph structure alone (ie don't add or delete edges during iteration), in which case compacting the Vertex down to a pair of parallel sorted arrays after loading from the VertexReader is not such a unreasonable situation. Of course, it then requires that random-access reads to the outbound edges do a log(numOutEdges) binary search, but not all graph algorithms require lots of random access to specific edges (as opposed to bulk access to all edges). I could imagine two separate primitive implementations - one based on a map-like structure (a la OpenXxxYyyHashMap), which gives random access to the edges (but then not sorted access! Note that the current code in my primitive vertex throws UnsupportedOperationException for lots of SortedMap methods, which happen to be non exercised in the unit tests), and one based on parallel arrays, which allows for very fast sequential read-only access, but slower random read-only access, and very much slower mutation speed. I wrote implementations like this for the Mahout Vector interface, and named them almost exactly like that: RandomAccessSparseVector, and SequentialAccessSparseVector (although I should have called the latter something like ReadOnlySequentialAccessSparseVector, but it would have been a lie, as you can modify it, you just shouldn't). These numbers now make a lot of sense - Objects Are Big, that's the moral of the story. Any place you need to hang onto gazillions of things in memory on the JVM, you should do your best to stuff them into primitive arrays.
          Hide
          Avery Ching added a comment -

          Thanks for isolating to the edge data only.

          These numbers are much more reasonable =). Getting rid of the Edge class for users (still useful for storing the requests) seems like an easy win for the Object based Vertex (around a 20% memory reduction for edges). I think that for many applications, using a pair of arrays might be reasonable and provide a big win for memory reduction (i.e. page rank).

          Now I'm really curious what was eating all the other memory though =)...

          Show
          Avery Ching added a comment - Thanks for isolating to the edge data only. These numbers are much more reasonable =). Getting rid of the Edge class for users (still useful for storing the requests) seems like an easy win for the Object based Vertex (around a 20% memory reduction for edges). I think that for many applications, using a pair of arrays might be reasonable and provide a big win for memory reduction (i.e. page rank). Now I'm really curious what was eating all the other memory though =)...
          Hide
          Dmitriy V. Ryaboy added a comment -

          There is something that's eating up memory in the LDFDVertex implementation that's not the OpenLongFloatHashMap.

          I changed my code to specifically test consumption of the edge map memory, and removing all the Vertex complexity, by just putting together 4 implementations of a most basic interface:

           public static abstract class Vertex {
              public abstract void addEdge(Edge<LongWritable, FloatWritable> edge);
            }
          

          The 4 implementations are "Current" (a map of LongWritables to Edge<LongWritable, FloatWritable>), "Just Value" (a map of LongWritable to FloatWritable), "Primitive Map" (an OpenLongFloatHashMap), and "Primitive" (two arrays of longs and floats, which resize and copy the whole array on every edge add – an obviously untenable, but least memory intensive, implementation).

          The code is at https://gist.github.com/1210524

          The results are much more sensible. Switching to the PrimitiveMap should be huge savings; even without that, getting rid of the duplicated LongWritable is quite noticeable.

          Current       : 0	144
          Just Value    : 0	144
          Primitive Map : 0	272
          Primitive     : 0	56
          Current       : 1	240
          Just Value    : 1	216
          Primitive Map : 1	272
          Primitive     : 1	72
          Current       : 10	1104
          Just Value    : 10	864
          Primitive Map : 10	528
          Primitive     : 10	176
          Current       : 100	10704
          Just Value    : 100	8304
          Primitive Map : 100	3728
          Primitive     : 100	1256
          Current       : 1000	104272
          Just Value    : 1000	80272
          Primitive Map : 1000	45976
          Primitive     : 1000	12056
          Current       : 10000	1025616
          Just Value    : 10000	785616
          Primitive Map : 10000	301192
          Primitive     : 10000	120056
          
          Show
          Dmitriy V. Ryaboy added a comment - There is something that's eating up memory in the LDFDVertex implementation that's not the OpenLongFloatHashMap. I changed my code to specifically test consumption of the edge map memory, and removing all the Vertex complexity, by just putting together 4 implementations of a most basic interface: public static abstract class Vertex { public abstract void addEdge(Edge<LongWritable, FloatWritable> edge); } The 4 implementations are "Current" (a map of LongWritables to Edge<LongWritable, FloatWritable>), "Just Value" (a map of LongWritable to FloatWritable), "Primitive Map" (an OpenLongFloatHashMap), and "Primitive" (two arrays of longs and floats, which resize and copy the whole array on every edge add – an obviously untenable, but least memory intensive, implementation). The code is at https://gist.github.com/1210524 The results are much more sensible. Switching to the PrimitiveMap should be huge savings; even without that, getting rid of the duplicated LongWritable is quite noticeable. Current : 0 144 Just Value : 0 144 Primitive Map : 0 272 Primitive : 0 56 Current : 1 240 Just Value : 1 216 Primitive Map : 1 272 Primitive : 1 72 Current : 10 1104 Just Value : 10 864 Primitive Map : 10 528 Primitive : 10 176 Current : 100 10704 Just Value : 100 8304 Primitive Map : 100 3728 Primitive : 100 1256 Current : 1000 104272 Just Value : 1000 80272 Primitive Map : 1000 45976 Primitive : 1000 12056 Current : 10000 1025616 Just Value : 10000 785616 Primitive Map : 10000 301192 Primitive : 10000 120056
          Hide
          Avery Ching added a comment -

          Thank you for providing an interesting set of results. I agree that this can/will be an issue when there are lots of outgoing edges (common in real graphs).

          The last grouping in your results especially is troubling:

          Tiny: 10000 123592
          Object: 10000 4290616
          Primitive: 10000 4431744

          It certainly appears that the edge relationship (dest id and edge value) data structure makes a big difference. However, I don't understand why the object and primitive results are similar, nor why using a TreeMap or org.apache.mahout.math.map.OpenLongFloatHashMap is so much higher overhead than a pair of primitive arrays.

          Show
          Avery Ching added a comment - Thank you for providing an interesting set of results. I agree that this can/will be an issue when there are lots of outgoing edges (common in real graphs). The last grouping in your results especially is troubling: Tiny: 10000 123592 Object: 10000 4290616 Primitive: 10000 4431744 It certainly appears that the edge relationship (dest id and edge value) data structure makes a big difference. However, I don't understand why the object and primitive results are similar, nor why using a TreeMap or org.apache.mahout.math.map.OpenLongFloatHashMap is so much higher overhead than a pair of primitive arrays.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I tried to benchmark the memory footprint on this and was surprised to find that it's actually larger than the existing implementation!

          At Jake's suggestion, I also added a dummy "TinyVertex" that just does the simplest thing possible, by keeping an array of primitives and resizing it when needed. This gives us the lower bound on what our memory utilization could look like. The answer is, we could be 30x more efficient in terms of memory utilization, at the cost of some CPU.

          Code is here: https://gist.github.com/1210245

          Show
          Dmitriy V. Ryaboy added a comment - I tried to benchmark the memory footprint on this and was surprised to find that it's actually larger than the existing implementation! At Jake's suggestion, I also added a dummy "TinyVertex" that just does the simplest thing possible, by keeping an array of primitives and resizing it when needed. This gives us the lower bound on what our memory utilization could look like. The answer is, we could be 30x more efficient in terms of memory utilization, at the cost of some CPU. Code is here: https://gist.github.com/1210245
          Hide
          Jake Mannix added a comment -

          Yeah, I was going to mention that this patch is out of date now, as the github mirror of apache trunk hadn't caught up with the other commit. It should have caught up by now, so I'll re-pull and regenerate.

          Avery: I would like to do some memory testing before we move forward with this, yes, that's the next step, after I rebase.

          Show
          Jake Mannix added a comment - Yeah, I was going to mention that this patch is out of date now, as the github mirror of apache trunk hadn't caught up with the other commit. It should have caught up by now, so I'll re-pull and regenerate. Avery: I would like to do some memory testing before we move forward with this, yes, that's the next step, after I rebase.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Jake, patch doesn't apply to current trunk. Can you rebase?

          Show
          Dmitriy V. Ryaboy added a comment - Jake, patch doesn't apply to current trunk. Can you rebase?
          Hide
          Avery Ching added a comment -

          Jake, this is pretty cool that you got it to work. Any thoughts on what to do next? Perhaps add this to the examples directory? Maybe do some sort of memory test to see the gains of a primitive subclass?

          Show
          Avery Ching added a comment - Jake, this is pretty cool that you got it to work. Any thoughts on what to do next? Perhaps add this to the examples directory? Maybe do some sort of memory test to see the gains of a primitive subclass?
          Hide
          Jake Mannix added a comment -

          Note: this patch introduces a dependency on mahout-collections, a fairly tiny jar which contains a ton of primitive-collections stuff, pretty performance optimized (and with nearly every primitive/primitive pairing you might want, as it's autogenerated from a template)

          Show
          Jake Mannix added a comment - Note: this patch introduces a dependency on mahout-collections, a fairly tiny jar which contains a ton of primitive-collections stuff, pretty performance optimized (and with nearly every primitive/primitive pairing you might want, as it's autogenerated from a template)
          Hide
          Jake Mannix added a comment -

          This is a toy version of LongDoubleFloatDoubleVertex, a proof of concept that you can get "SimplePageRankVertex extends LongDoubleFloatDoubleVertex" to pass its current unit tests without subclassing Vertex (and only using primitives internally!)

          Show
          Jake Mannix added a comment - This is a toy version of LongDoubleFloatDoubleVertex, a proof of concept that you can get "SimplePageRankVertex extends LongDoubleFloatDoubleVertex" to pass its current unit tests without subclassing Vertex (and only using primitives internally!)
          Hide
          Jake Mannix added a comment -

          patch with respect to GIRAPH-27 (I can generate w.r.t. trunk if desired)

          Show
          Jake Mannix added a comment - patch with respect to GIRAPH-27 (I can generate w.r.t. trunk if desired)

            People

            • Assignee:
              Jake Mannix
              Reporter:
              Jake Mannix
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development