Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3466

Cannot detect mimetype of xhtml file when script is first node instead of html

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.27
    • None
    • detector, mime
    • None

    Description

      mime-type of below xhtml file deduced as 'application/xml' instead of 'application/xhtml+xml' 

      <?xml version="1.0" encoding="UTF-8" ?>
      <script xmlns="http://www.w3.org/1999/xhtml"><![CDATA[
        alert(555);
        ]]></script>
      

       

       one possible solution is to add 'script' in tika-mimetypes.xml, like 

      <mime-type type="application/xhtml+xml">
        <!-- The magic priority for xhtml+xml needs to be lower than that of -->
        <!--  files that contain HTML within them, e.g. mime emails -->
        <magic priority="40">
          <match value="&lt;html xmlns=" type="string" offset="0:8192"/>
        </magic>
        <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="html"/>
        <root-XML namespaceURI="http://www.w3.org/1999/xhtml" localName="script"/>
        <glob pattern="*.xhtml"/>
        <glob pattern="*.xht"/>
      </mime-type>
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            psakkanan Packiaraj Sakkanan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: