Sorry to comment late here, but when indexing in hadoop, it is really nice to avoid any central dependence. It is also nice to focus the map-side join on items likely to match. Thirdly, reduce side indexing is typically really important.
The conclusions from these three considerations vary by duplication rate. Using reduce-side indexing gets rid of most of the problems of duplicate versions of a single document (with the same sort key) since the reducer can scan to see whether it has the final copy handy before adding a document to the index.
There remain problems where we have to not index documents that already exist in the index or to generate a deletion list that can assist in applying the index update. The former problem is usually the more severe one because it isn't unusual for data sources to just include a full dump of all documents and assume that the consumer will figure out which are new or updated. Here you would like to only index new and modified documents.
My own preference for this is to avoid the complication of the map-side join using Bloom filters and simply export a very simple list of stub documents that correspond to the documents in the index. These stub documents should be much smaller than the average document (unless you are indexing tweets) which makes passing around great masses of stub documents not such a problem since Hadoop shuffle, copy and sort times are all dominated by Lucene index times. Passing stub documents allows the reducer to simply iterate through all documents with the same key keeping the latest version or any stub that is encountered. For documents without a stub, normal indexing can be done with the slight addition exporting a list of stub documents for the new additions.
The same thing could be done with a map-side join, but the trade-off is that you now need considerably more memory for the mapper to store the entire bitmap in memory as opposed needing (somewhat) more time to pass the stub documents around. How that trade-off plays out in the real world isn't clear. My personal preference is to keep heap space small since the time cost is pretty minimal for me.
This problem also turns up in our PDF conversion pipeline where we keep check-sums of each PDF that has already been converted to viewable forms. In that case, the ratio of real document size to stub size is even more preponderate.