Unification of the resulting models is probably much easier than the unification of the model building process itself.
Some of the problems I have seen include:
a) all of our clustering and classification models should be able to accept vectors and produce either a "most-likely" category or a vector of scores for all possible categories. Unfortunately, there is no uniform way to load a model from a file and no uniform object structure for these models and no consistent way to call them.
b) most of our learning algorithms would be happy with vectors, but there is a pretty fundamental difference between good ways to call hadoop-based and sequential training algorithms. The sequential stuff is traditional java so the interface is very easy. The parallel stuff is considerably harder to make into a really good interface. We may learn some tricks with Plume or we may be able to use the Distributed Row Matrix, but it isn't an obvious answer.
c) in some cases, the vectors are noticeably larger than the original data. This occurs when the original data is very sparse and we are looking at lots of interaction variables. Again, for sequential algorithms, this is pretty easy to deal with, but for parallel ones, it really might be better to store the original data and pass in a function that handles the vectorization on the fly.