Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1771

lower magic priority xhtml magic priority to ensure emails detected as message/rfc822

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11
    • Component/s: detector
    • Labels:
      None

      Description

      Emails I have (happy to share if you want) contain XHTML, as one part of a multipart email. Prior to this pull request, the priority on the application/xhtml+xml magic detector was 50, equal to the priority on the message/rfc822 detector. Because of the relative position of the two detectors in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.

      With this PR, by downgrading the priority of application/xhtml+xml to 40, the more-sensitive email magic detectors take precedence, causing the emails to be properly detected as message/rfc822.

      I have not run this thru the govdocs tester or anything other than my own documents, so, full disclosure, this could cause false negative xhtml-detections elsewhere.

      I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.

        Attachments

          Activity

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              jeremybmerrill Jeremy B. Merrill
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: