VertexID in the vertex must be comparable. That is actually enough for everything.
I was just profiling the memory leak by using 1 mio. pagerank vertices and 10 edges each (50mb).
Here is much more detailed memory analysis:
After Reading Vertices to RAM in setup (Superstep2)
600mb raw heap usage.
418199256 bytes occupied by the vertices.
287999928 bytes occupied by Text objects (used as Vertex Key 48000000 bytes, rest is edge bytes)
237999192 bytes occupied by Edges (Text Objects and Null references)
In the first superstep
Vertex memory keeps constant. Messages are as follows:
5 mio. GraphJobMessages (only half of the out edges) 225mb. So with all messages, this sums up to a bit less than 500 mb (10 times the graph size!).
Each vertex message contains ~40 bytes, 20 Text, 20 DoubleWritable.
In the fourth superstep (of 6 in total)
GC'd to 1,1GB again
BSPMessageBundle contains 4,1 mio messages and is only one time in memory. However the linked list in that hashmap of the bundle contains 100 MB of data.
Maybe we can switch to an arraylist again, they are much sparser in memory because they aren't doubly linked and we should release the reference of it once it is send via RPC.
However, everything is collected properly, so there is no memory leak in my opinion.
BTW: is it intended in the VerticesInfo to do a linear search for every vertex? That is slow like hell.