Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2597

Attachment Extraction Case Sensitivity

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17
    • None
    • app
    • None
    • windows

    Description

      Using the --extract option on a pdf with embedded files I am seeing that not all of the attachments are extracted.  There are several files embedded that contain the same name.  The names that are exactly the same are accounted for with a suffix of (1) etc.  However when there is a similar name that is not the same case the parse does not account for changing the name with the suffix and thus overwrites the file on disk.  Example
      FW Letter,.msg
      FW letter.msg

      Will result in only one attachment extracted.  Would it be possible to update the filename comparison to account for windows file systems which see those two files as the same name?

      Thanks!

      Attachments

        Activity

          People

            Unassigned Unassigned
            toaddixon Todd Dixon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: