Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.13
    • Component/s: mime
    • Labels:

      Description

      Updated Mime-Magic for 6 mime types:
      1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0".
      2. application/wordperfect: files begin with pattern "ÿWPC" .
      3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type)
      4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400)
      5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
      6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)

      https://github.com/NamithaGS/tika/commit/780100767e24505a24595ea6db43978d0700e220

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user NamithaGS opened a pull request:

        https://github.com/apache/tika/pull/83

        Fix for TIKA-1881

        Updated Mime-Magic for 6 mime types:
        1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0".
        2. application/wordperfect: files begin with pattern "ÿWPC" .
        3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type)
        4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400)
        5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
        6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/NamithaGS/tika master

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/83.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #83


        commit 780100767e24505a24595ea6db43978d0700e220
        Author: NamithaGS <gs.namitha@gmail.com>
        Date: 2016-03-01T07:21:28Z

        Update tika-mimetypes.xml

        Updated Mime-Magic for 6 mime types:
        1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0".
        2. application/wordperfect: files begin with pattern "ÿWPC" .
        3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type)
        4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400)
        5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50)
        6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user NamithaGS opened a pull request: https://github.com/apache/tika/pull/83 Fix for TIKA-1881 Updated Mime-Magic for 6 mime types: 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0". 2. application/wordperfect: files begin with pattern "ÿWPC" . 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type) 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50) 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50) You can merge this pull request into a Git repository by running: $ git pull https://github.com/NamithaGS/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/83.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #83 commit 780100767e24505a24595ea6db43978d0700e220 Author: NamithaGS <gs.namitha@gmail.com> Date: 2016-03-01T07:21:28Z Update tika-mimetypes.xml Updated Mime-Magic for 6 mime types: 1. application/postscript : files begin with pattern "%!PS-Adobe-3.0 EPSF-3.0". 2. application/wordperfect: files begin with pattern "ÿWPC" . 3. image/tiff : updated pattern for "MM.+" for Big endian format.(occur at the beginning of files of tiff mime type) 4. application/rdf+xml : updated pattern "rdf" ( from byte offset 5 to 400) 5. application/atom+xml : updated pattern "feed" ( from byte offset 5 to 50) 6. application/rss+xml : updated pattern "rss" ( from byte offset 5 to 50)
        Hide
        gagravarr Nick Burch added a comment -

        As mentioned on the Github pull request:

        For the Atom, RSS and RDF ones - is the magic required? Doesn't the XML detector get them already via the namespace? And without risk of mis-detecting text files which happen to mention feed or rss or rdf near the start?

        For the Postscript one - could you re-do this as text rather than hex, so it's easier to read?

        (Others look fine!)

        Show
        gagravarr Nick Burch added a comment - As mentioned on the Github pull request: For the Atom, RSS and RDF ones - is the magic required? Doesn't the XML detector get them already via the namespace? And without risk of mis-detecting text files which happen to mention feed or rss or rdf near the start? For the Postscript one - could you re-do this as text rather than hex, so it's easier to read? (Others look fine!)
        Hide
        ganiga@usc.edu Namitha Sanjeeva Ganiga added a comment -

        For the Atom, RSS and RDF :
        This was from the FHT analysis in these files. We found some of these files classified into Octet-Stream, and all these 3 types had the occurrence of the pattern may times in the first 50 bytes or so. I based this purely on the analysis and cannot hence find any information about this on the web. As you mention, if your advice is to remove these patterns to be on the safer side, I will modify the pull request removing these.

        For the Postscript one : I will redo this in the pull request itself.

        Show
        ganiga@usc.edu Namitha Sanjeeva Ganiga added a comment - For the Atom, RSS and RDF : This was from the FHT analysis in these files. We found some of these files classified into Octet-Stream, and all these 3 types had the occurrence of the pattern may times in the first 50 bytes or so. I based this purely on the analysis and cannot hence find any information about this on the web. As you mention, if your advice is to remove these patterns to be on the safer side, I will modify the pull request removing these. For the Postscript one : I will redo this in the pull request itself.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Nick Burch if there is valid magic for XML files, why not use them? I'd say include this.

        Show
        chrismattmann Chris A. Mattmann added a comment - Nick Burch if there is valid magic for XML files, why not use them? I'd say include this.
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Which files would mention feed or rss or rdf near the start - I'd say that's low enough of a probability that we should consider including the magic.

        Show
        chrismattmann Chris A. Mattmann added a comment - Which files would mention feed or rss or rdf near the start - I'd say that's low enough of a probability that we should consider including the magic.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/83

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/83
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Based on comments and the updates by Namitha I went ahead and committed this. Thanks Nick Burch and Namitha Sanjeeva Ganiga!

        Show
        chrismattmann Chris A. Mattmann added a comment - Based on comments and the updates by Namitha I went ahead and committed this. Thanks Nick Burch and Namitha Sanjeeva Ganiga !
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in tika-trunk-jdk1.7 #957 (See https://builds.apache.org/job/tika-trunk-jdk1.7/957/)
        Record entry for TIKA-1881. (mattmann: rev e4dc21ce27ba861370c37f5a5408ef90b895a622)

        • CHANGES.txt
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #957 (See https://builds.apache.org/job/tika-trunk-jdk1.7/957/ ) Record entry for TIKA-1881 . (mattmann: rev e4dc21ce27ba861370c37f5a5408ef90b895a622) CHANGES.txt

          People

          • Assignee:
            chrismattmann Chris A. Mattmann
            Reporter:
            ganiga@usc.edu Namitha Sanjeeva Ganiga
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development