Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1955 MIME types updates and additions for Scientific Data based on TREC-DD-Polar
  3. TIKA-1885

Tika MIME updates for *.cdf and *.xar and custom zero length file detector based on TREC-DD-Polar

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.13
    • Component/s: core, detector, mime
    • Labels:
    • Environment:

      Windows OS X64 , Java

    • Flags:
      Patch

      Description

      Updated tika-mimetypes.xml and detector to identify new file types in TREC DD Polar dataset.

        Issue Links

          Activity

          Hide
          adeshgup Adesh Gupta added a comment -

          Added a custom detector and an updates tika-mimetypes.xml file

          Show
          adeshgup Adesh Gupta added a comment - Added a custom detector and an updates tika-mimetypes.xml file
          Hide
          gagravarr Nick Burch added a comment -

          Did you mean to close this? Is there a matching pull request or patch that needs to be applied to implement the changes? And what file types are you working with?

          Show
          gagravarr Nick Burch added a comment - Did you mean to close this? Is there a matching pull request or patch that needs to be applied to implement the changes? And what file types are you working with?
          Hide
          adeshgup Adesh Gupta added a comment -

          The updates are for identifying .cdf , .xar , . wordperfect and application/zerosize files. A pull request will be generated today in regards to this

          Show
          adeshgup Adesh Gupta added a comment - The updates are for identifying .cdf , .xar , . wordperfect and application/zerosize files. A pull request will be generated today in regards to this
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          re-opening per comments

          Show
          chrismattmann Chris A. Mattmann added a comment - re-opening per comments
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          please provide the pull request Adesh Gupta

          Show
          chrismattmann Chris A. Mattmann added a comment - please provide the pull request Adesh Gupta
          Hide
          gagravarr Nick Burch added a comment -

          Any luck with the pull request?

          Show
          gagravarr Nick Burch added a comment - Any luck with the pull request?
          Hide
          adeshgup Adesh Gupta added a comment -

          Pull request has been created. Sorry for the delay.

          Show
          adeshgup Adesh Gupta added a comment - Pull request has been created. Sorry for the delay.
          Hide
          adeshgup Adesh Gupta added a comment -

          Pull request has been created. Sorry for the delay.

          Show
          adeshgup Adesh Gupta added a comment - Pull request has been created. Sorry for the delay.
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Adesh Gupta you need to update your PR to only include your specific changes. Please rebase and review your PR.

          Show
          chrismattmann Chris A. Mattmann added a comment - Adesh Gupta you need to update your PR to only include your specific changes. Please rebase and review your PR.
          Hide
          chrismattmann Chris A. Mattmann added a comment -
          Show
          chrismattmann Chris A. Mattmann added a comment - ping Adesh Gupta
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user adeshgupta closed the pull request at:

          https://github.com/apache/tika/pull/89

          Show
          githubbot ASF GitHub Bot added a comment - Github user adeshgupta closed the pull request at: https://github.com/apache/tika/pull/89
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user adeshgupta opened a pull request:

          https://github.com/apache/tika/pull/115

          fix for TIKA-1885

          Hello @chrismattmann , I guess this is what you were looking for. Thanks and do update on the same.
          @aishward @radhikachandwadkar

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/adeshgupta/tika fixTIKA-1885

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/115.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #115


          commit 96ec428ccf20edf4eecf995c48129367ffc8bbe1
          Author: adeshgupta <adeshgup@usc.edu>
          Date: 2016-05-03T04:08:54Z

          fix for TIKA-1885


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user adeshgupta opened a pull request: https://github.com/apache/tika/pull/115 fix for TIKA-1885 Hello @chrismattmann , I guess this is what you were looking for. Thanks and do update on the same. @aishward @radhikachandwadkar You can merge this pull request into a Git repository by running: $ git pull https://github.com/adeshgupta/tika fixTIKA-1885 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/115.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #115 commit 96ec428ccf20edf4eecf995c48129367ffc8bbe1 Author: adeshgupta <adeshgup@usc.edu> Date: 2016-05-03T04:08:54Z fix for TIKA-1885
          Hide
          chris.a.mattmann@jpl.nasa.gov Mattmann, Chris A (388J) added a comment -

          Hello,

          I am on vacation and will return to work Monday May 9, 2016. In my absence, the following folks are responsible POCs for my work:

          1. DARPA Memex / DARPA XDATA / DHS OSS

          • Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov)
          • Wayne Burke (Wayne.M.Burke@jpl.nasa.gov)

          2. Celgene

          • Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov)
          • Robert Tapella (Rob.Tapella@jpl.nasa.gov)

          3. AIST SciSpark

          • Brian Wilson (bdwilson@jpl.nasa.gov)
          • Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov)

          4. NASA ESDSWG

          • Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov)
          • Maziyar Boustani (Maziyar.Boustani@jpl.nasa.gov)
          • Wayne Burke (Wayne.M.Burke@jpl.nasa.gov)

          If someone needs to reach me text would be best, and/or email, however in an emergency I have my cell phone with me (626-755-6564).

          I look forward to keeping abreast of the situations and will have email access during this time. Thank you all for your support.

          Cheers!

          Chris

          Show
          chris.a.mattmann@jpl.nasa.gov Mattmann, Chris A (388J) added a comment - Hello, I am on vacation and will return to work Monday May 9, 2016. In my absence, the following folks are responsible POCs for my work: 1. DARPA Memex / DARPA XDATA / DHS OSS Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov) Wayne Burke (Wayne.M.Burke@jpl.nasa.gov) 2. Celgene Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov) Robert Tapella (Rob.Tapella@jpl.nasa.gov) 3. AIST SciSpark Brian Wilson (bdwilson@jpl.nasa.gov) Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov) 4. NASA ESDSWG Paul Ramirez (Paul.M.Ramirez@jpl.nasa.gov) Maziyar Boustani (Maziyar.Boustani@jpl.nasa.gov) Wayne Burke (Wayne.M.Burke@jpl.nasa.gov) If someone needs to reach me text would be best, and/or email, however in an emergency I have my cell phone with me (626-755-6564). I look forward to keeping abreast of the situations and will have email access during this time. Thank you all for your support. Cheers! Chris
          Hide
          davemeikle Dave Meikle added a comment -

          Hi Adesh Gupta - Just reviewing the pull request. Do you have any tests with this change? If not, I can drop some in and assuming they pass include this in 1.13.

          Show
          davemeikle Dave Meikle added a comment - Hi Adesh Gupta - Just reviewing the pull request. Do you have any tests with this change? If not, I can drop some in and assuming they pass include this in 1.13.
          Hide
          davemeikle Dave Meikle added a comment - - edited

          Have incorporated this code as to not block TIKA-1955.

          Ended up making the following changes:

          • Renamed the detector to ZeroSizeFileDetector
          • Moved it from tika-parsers into tika-core under org.apache.detect package
          • Added a test class in tika-core
          • Changed mime type to application/x-zerosize
          • Added in the ASF header to all files.

          Adesh Gupta - there was no tika-mimetypes.xml updates in the PR. Has this been done elsewhere?

          Show
          davemeikle Dave Meikle added a comment - - edited Have incorporated this code as to not block TIKA-1955 . Ended up making the following changes: Renamed the detector to ZeroSizeFileDetector Moved it from tika-parsers into tika-core under org.apache.detect package Added a test class in tika-core Changed mime type to application/x-zerosize Added in the ASF header to all files. Adesh Gupta - there was no tika-mimetypes.xml updates in the PR. Has this been done elsewhere?
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #979 (See https://builds.apache.org/job/tika-trunk-jdk1.7/979/)
          TIKA-1885: Addition of ZeroSizeFileDetector based on Pull Request from (dmeikle: rev d447193f29531df3022f5137b8f0ec1c73e58cc8)

          • tika-core/src/main/java/org/apache/tika/mime/MediaType.java
          • tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java
          • tika-core/src/test/java/org/apache/tika/detect/ZeroSizeFileDetectorTest.java
          • tika-core/src/main/java/org/apache/tika/detect/ZeroSizeFileDetector.java
            Added CHANGE information for TIKA-1885 and TIKA-1965 (dmeikle: rev 5f0e9303929b34b018e6857c32bed87c80f0c9d2)
          • CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #979 (See https://builds.apache.org/job/tika-trunk-jdk1.7/979/ ) TIKA-1885 : Addition of ZeroSizeFileDetector based on Pull Request from (dmeikle: rev d447193f29531df3022f5137b8f0ec1c73e58cc8) tika-core/src/main/java/org/apache/tika/mime/MediaType.java tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java tika-core/src/test/java/org/apache/tika/detect/ZeroSizeFileDetectorTest.java tika-core/src/main/java/org/apache/tika/detect/ZeroSizeFileDetector.java Added CHANGE information for TIKA-1885 and TIKA-1965 (dmeikle: rev 5f0e9303929b34b018e6857c32bed87c80f0c9d2) CHANGES.txt
          Hide
          davemeikle Dave Meikle added a comment -

          OK - checking your GitHub account there appears to be nothing newer there, so I am going to roll the 1.13 release.

          If I am wrong, just shout and we can incorporate in a second RC.

          Show
          davemeikle Dave Meikle added a comment - OK - checking your GitHub account there appears to be nothing newer there, so I am going to roll the 1.13 release. If I am wrong, just shout and we can incorporate in a second RC.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-trunk-jdk1.7 #980 (See https://builds.apache.org/job/tika-trunk-jdk1.7/980/)
          TIKA-1885: Updated test to specify charset in getBytes() (dmeikle: rev eede044eab86b5a380ddd8a585e6dda563dc42d3)

          • tika-core/src/test/java/org/apache/tika/detect/ZeroSizeFileDetectorTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-trunk-jdk1.7 #980 (See https://builds.apache.org/job/tika-trunk-jdk1.7/980/ ) TIKA-1885 : Updated test to specify charset in getBytes() (dmeikle: rev eede044eab86b5a380ddd8a585e6dda563dc42d3) tika-core/src/test/java/org/apache/tika/detect/ZeroSizeFileDetectorTest.java
          Hide
          davemeikle Dave Meikle added a comment -

          Code committed in d447193f29531df3022f5137b8f0ec1c73e58cc8

          Show
          davemeikle Dave Meikle added a comment - Code committed in d447193f29531df3022f5137b8f0ec1c73e58cc8
          Hide
          nicholasc Nick C added a comment -

          I was looking at the code for ZeroSizeFileDetector and noticed the use of InputStream.available. I don't think that method is a very reliable way to see if a stream is empty. The java doc says "Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream" and the default implementation returns 0. Also wouldn’t “application/x-empty” be better then “application/x-zerovalue“ its what the Linux file command returns.

          Show
          nicholasc Nick C added a comment - I was looking at the code for ZeroSizeFileDetector and noticed the use of InputStream.available. I don't think that method is a very reliable way to see if a stream is empty. The java doc says "Returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream" and the default implementation returns 0. Also wouldn’t “application/x-empty” be better then “application/x-zerovalue“ its what the Linux file command returns.
          Hide
          davemeikle Dave Meikle added a comment -

          Good point re Stream. Checking for -1 from read() will be more accurate.

          Re mimetype, not sure if this matters to you and the team from USC Chris A. Mattmann?

          Show
          davemeikle Dave Meikle added a comment - Good point re Stream. Checking for -1 from read() will be more accurate. Re mimetype, not sure if this matters to you and the team from USC Chris A. Mattmann ?
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Happy to use the Linux file command one

          Show
          chrismattmann Chris A. Mattmann added a comment - Happy to use the Linux file command one
          Hide
          gagravarr Nick Burch added a comment -

          I think we need to check with a read. Blocking won't matter, as any of the other detectors would block too. x-empty feels better to me for the mimetype, and if that's what others already use all the better!

          Show
          gagravarr Nick Burch added a comment - I think we need to check with a read. Blocking won't matter, as any of the other detectors would block too. x-empty feels better to me for the mimetype, and if that's what others already use all the better!
          Hide
          nicholasc Nick C added a comment -

          Just saw the changes and noticed a bug; you need to add a mark(1) and reset call like the other detectors so it doesn't actually consume the byte. (Can't wait for 1.13 )

          Show
          nicholasc Nick C added a comment - Just saw the changes and noticed a bug; you need to add a mark(1) and reset call like the other detectors so it doesn't actually consume the byte. (Can't wait for 1.13 )
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #985 (See https://builds.apache.org/job/tika-trunk-jdk1.7/985/)
          TIKA-1885: Updated MimeType to application/x-empty to match Unix file (dmeikle: rev 114b6044a74fe463b90712345ea8e3b2cb085b62)

          • tika-core/src/main/java/org/apache/tika/mime/MediaType.java
          • tika-core/src/main/java/org/apache/tika/detect/ZeroSizeFileDetector.java
          • tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java
          • tika-core/src/test/java/org/apache/tika/detect/ZeroSizeFileDetectorTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #985 (See https://builds.apache.org/job/tika-trunk-jdk1.7/985/ ) TIKA-1885 : Updated MimeType to application/x-empty to match Unix file (dmeikle: rev 114b6044a74fe463b90712345ea8e3b2cb085b62) tika-core/src/main/java/org/apache/tika/mime/MediaType.java tika-core/src/main/java/org/apache/tika/detect/ZeroSizeFileDetector.java tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java tika-core/src/test/java/org/apache/tika/detect/ZeroSizeFileDetectorTest.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/115

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/115

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              adeshgup Adesh Gupta
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development