Apparently the Dirichlet method returns a negative score if the tf / docLen < corpusTf / corpusLen. Unfortunately the negative number can be arbitrarily large, so it's not as easy as adding a constant to the score. This of course makes sense if all documents are scored, as the function is monotone and consequently documents, whose tf is 0, will always be ranked lower than those that contain the word. But this is not how IR engines work.
Having said that, I believe that we could simulate such a system. I don't know exactly how the query architecture works, but I presume the clauses that don't match a document are assigned a zero value. Now instead of this zero, the Scorer (or whatever class does this) could ask for a default value from the Similarity. In this case LMDirichletSimilarity could return score(stats, 0, Integer.MAX_VALUE), which is somewhere around -12.
If we don't do this, we have three options:
1. add score(stats, 0, Integer.MAX_VALUE) to the score
2. if (score < 0) return 0
3. add corpusTf / corpusLen * docLen to tf
All ensure a positive score, but also each has its own disadvantage.
1. adds a pretty big constant to the score, which may not play well with the other parts of the query
2. some documents that contain the term get the same 0 score as documents that don't (though I cannot say this is not in line with the LM approach)
3. this introduces a transformation that is difficult to characterize
For the time being, I'll go with 2, but we have to discuss this.