Ok, so I've thought about this a little, and the implementation that Frank put on here, and I had on my github branch too, essentially, is probably a bad idea, for exactly Lance's points mentioned here.
So instead, we modify VectorDumper and VectorHelper to add a couple of static methods and options:
public static String vectorToJson(Vector vector, String dictionary, int maxEntries, boolean sort)
where the "sort" option sorts by the values of the Vector entries, and maxEntries describes the maximum number of vector entries to use. If dictionary is supplied and not null, then the vector indexes are replaced with their respective term entries in the dictionary.
This way, VectorDumper is modified with the following options:
Option sortVectorsOpt = obuilder.withLongName("sortVectors").withRequired(false).withDescription(
"Sort output key/value pairs of the vector entries in abs magnitude descending order")
Option numIndexesPerVectorOpt = obuilder.withLongName("vectorSize").withShortName("vs").withRequired(false)
.withDescription("Truncate vectors to <vs> length when dumping (most useful when in"
+ " conjunction with -sort").create();
Then if you have clusters represented as vector centroids (or distributions over terms/features, or anything else which is a collection of Vectors linked to a dictionary of String labels for the vector indexes), then you don't really need a "ClusterDumper", as
s "path/to/vectors/part*" --dictionary "path/to/dictionary.file-0" -dt sequencefile -sort --vectorSize 100 -o local_vectors.json
puts each vector in "path/to/vectors/part-*" one per line in local_vectors.json, in json format, with the keys being the terms with the highest weight for the vector, the values being the vector values, and only the top 100 (by value) per vector are emitted.
I've found this modification to VectorDumper invaluable in inspecting LDA topic models, but doing it without modifying the Vector interface is even better.