Well, yes, I was thinking of batch classification (like, constructing
confusion matrices, or running the training data back through the model to test the model).
But the problem I'm running into with the code is that the model
is too large to load in a single process, let alone multiple mappers.
So classifying fast doesn't help if simply loading the model is very slow
(and I mean very slow, and then doesn't necessarily succeed anyway – out of mem).
I also admit that "batch classification" – in the sense that there is overlap
in the different feature sets from different documents –
makes it more interesting / saves some work perhaps, but you can't count on that anyway.
Yes, you might want something fast for single-document classification,
but map-reduce isn't the right tool for that. Indexed structures are better.
The choices are either some indexed structure (like HBase) which can
handle large datasets / models, or just use map-reduce to join
the model to the data. The latter is definitely not useless –
usages similiarly divide into people who have a lot of data / docs to classify,
versus people who are building some kind of online system.
Throughput versus round-trip time.
Also, note that with an indexed solution, you might have contention for the indexed data –
if there's only one copy (which should probably be the case, for large models).
So I'd suggest implementing both, and to consider the cases where the models are very large
(which is where map-reduce shines anyway). I might be the only person commenting
who has tried a lot of data (800Meg input document file), and as I said it would
be nice to have some results (confusion matrices)
to see if the method is working for me and my particular data.
If nobody else agrees, I might have to try it myself, but I'm new at this
and sometimes get pulled away for other work.