To my mind, there are three families of things that can go wrong:
1) Parser can fail
1a) throw an exception
1b) hang forever
2) Fail to extract text and/or metadata from documents
2a) nothing is extracted
2b) some document components or attachments are not extracted:
TIKA-1317 and TIKA-1228
3) Extract junk (mojibake, too many spaces in pdfs, fail to add space btwn runs in .docx, etc), in which case there are two options:
3a) We can do better.
3b) We can't...the document is just plain broken.
We can easily count and compare 1). By easily, I mean that I haven't fully worked it out, but it should be fairly straightforward.
Without a truth set or a comparison parser, we cannot easily measure 2a or 2b. For 2a, if there is no text, maybe there really is no text (image only pdfs or just a docx that contains images). For 2b, we're really out of luck without other resources.
For 3), there's lots of room for work. In short, I'd think we'd want to calculate how "languagey" the extracted text is. Some indicators that occur to me:
a) Type/token ratio or token entropy
b) Average word length (with an exception for non-whitespace languages)
c) Ratio of alphanumerics to total string length
d) Analysis of language id confidence scores...if the string is long enough, you'd expect a langid component to return a very high score for the best language and then far lower scores for the 2nd and 3rd best languages. If the langid component returns flat scores, then that might be an indicator that something didn't go well.
What do you think? Are there other things that can go wrong? What else should we try to measure, in a supervised (not ideal) or semi-supervised (better) or unsupervised (best)?