Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3253

improve "attachments" tika-eval report directory

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.25
    • Fix Version/s: 2.0.0, 1.26
    • Component/s: tika-eval
    • Labels:
      None
    • Environment:

      W10

      Description

      While doing regression testing for PDFBox I found

      container_files_missing_in_B_by_mime.xlsx

      which has

      MIME_STRING CNT
      application/pdf 4

      I have no idea which files this is about. The other reports don't tell it. I was able to solve this by accessing the H2 database and then submitting this query

      select pa.file_name
      from profiles_a pa
      left join profiles_b pb on pa.id=pb.id
      where pb.id is null and pa.is_embedded=false
      

      and got
      GHOSTSCRIPT-690526-0.pdf
      GHOSTSCRIPT-692591-0.pdf
      GHOSTSCRIPT-692591-2.pdf
      PDFBOX-4319-0.zip-0.pdf

      So my suggestion is to add 2 files to the report directory where the names are mentioned.

      I have attached one of the "bad" PDF files. The B extract is empty, tika runs forever. I'll investigate that separately. (Update: PDFBOX-5049. Will probably be solved by TIKA-3246)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tilman Tilman Hausherr

              Dates

              • Created:
                Updated:

                Issue deployment