Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3150

MimeType Regex End of Binary File Fails

    XMLWordPrintableJSON

    Details

      Description

      Summary

      Regular expressions for matching mime types in custom Tika config files fail when trying to match exactly up to end of a binary file with regex $ operator or {} operators.

      Steps to reproduce

      Let's say, for example, we have a binary file that begins with 3 bytes, followed by 4 0x00 bytes, and this whole pattern repeats 5 times. The following should work for that situation

      <mime-type type='application/MY_CUSTOM_FORMAT'>
          <acronym>custom</acronym>
          <magic priority='50'>
              <match value="(?s)^.{3}(\\x00){4}){5}$" type="regex" offset="24"/>
          </match>
      </mime-type>
      

      The $ operator causes this regex to fail. Additionally, changing the regex to match exactly 5 times to 6 times, does not cause the regex to fail, even though this would cause the regex to match past the end of the file. Is this because the regex is wrapping around the whole file back to the beginning?

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              david.margolis David Margolis
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: