While doing regression testing for PDFBox I found
container_files_missing_in_B_by_mime.xlsx
which has
MIME_STRING CNT
application/pdf 4
I have no idea which files this is about. The other reports don't tell it. I was able to solve this by accessing the H2 database and then submitting this query
select pa.file_name from profiles_a pa left join profiles_b pb on pa.id=pb.id where pb.id is null and pa.is_embedded=false
and got
GHOSTSCRIPT-690526-0.pdf
GHOSTSCRIPT-692591-0.pdf
GHOSTSCRIPT-692591-2.pdf
PDFBOX-4319-0.zip-0.pdf
So my suggestion is to add 2 files to the report directory where the names are mentioned.
I have attached one of the "bad" PDF files. The B extract is empty, tika runs forever. I'll investigate that separately. (Update: PDFBOX-5049. Will probably be solved by TIKA-3246)