Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1946

Add mime detection and parser for WordPerfect

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: mime, parser
    • Labels:
      None

      Description

      I noticed some code on github for parsing WordPerfect files (https://github.com/Norconex/importer) Also looks like the author Pascal Essiembre has contributed to Tika before

      1. TIKA-1946-pascal.essiembre-01.patch
        7 kB
        Pascal Essiembre
      2. wordperfect_mimes_fuller.zip
        6 kB
        Tim Allison
      3. wordperfect_signatures_by_versions.xlsx
        9 kB
        Pascal Essiembre

        Issue Links

          Activity

          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          This is code imported from one of our existing project (Norconex Importer) so I thought it would be the same as the IBM files for CharsetDetector. Being said, I am perfectly fine to comply with the general practice (the NOTICE file) and you can remove all copyright statements from file headers.

          Show
          pascal.essiembre Pascal Essiembre added a comment - This is code imported from one of our existing project (Norconex Importer) so I thought it would be the same as the IBM files for CharsetDetector. Being said, I am perfectly fine to comply with the general practice (the NOTICE file) and you can remove all copyright statements from file headers.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #188 (See https://builds.apache.org/job/tika-2.x/188/)
          TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as (tallison: rev d8fa3c2a821c99c3580cbed978fdb6d02f003d57)

          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_1.wp
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java
          • (edit) tika-app/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_0.wp
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #188 (See https://builds.apache.org/job/tika-2.x/188/ ) TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as (tallison: rev d8fa3c2a821c99c3580cbed978fdb6d02f003d57) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_1.wp (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java (edit) tika-app/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_0.wp
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ah, ok. Pascal Essiembre do you mind if we remove the copyright statements, or move them into the NOTICE file [1].

          [1] "Consider moving copyright attributions from source documents to the NOTICE. Read Apache policy on headers."

          Show
          tallison@mitre.org Tim Allison added a comment - Ah, ok. Pascal Essiembre do you mind if we remove the copyright statements, or move them into the NOTICE file [1] . [1] "Consider moving copyright attributions from source documents to the NOTICE. Read Apache policy on headers."
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #89 (See https://builds.apache.org/job/tika-2.x-windows/89/)
          TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as (tallison: rev d8fa3c2a821c99c3580cbed978fdb6d02f003d57)

          • (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_1.wp
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_0.wp
          • (edit) tika-app/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #89 (See https://builds.apache.org/job/tika-2.x-windows/89/ ) TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as (tallison: rev d8fa3c2a821c99c3580cbed978fdb6d02f003d57) (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_1.wp (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect_5_0.wp (edit) tika-app/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Great. Please open another issue for 5.x.

          I don't know if we have any other WP besides 5.1 and 5.0.

          I checked for a DROID-detected mime (like '%949%') for 4.x see PRONOM and I couldn't find anything...

          4.x takes us back to 1984-1988, so I'm not surprised we don't have any.

          PRONOM lists 6.0 as the last version of WordPerfect...so I think we're good.

          Show
          tallison@mitre.org Tim Allison added a comment - Great. Please open another issue for 5.x. I don't know if we have any other WP besides 5.1 and 5.0. I checked for a DROID-detected mime ( like '%949%' ) for 4.x see PRONOM and I couldn't find anything... 4.x takes us back to 1984-1988, so I'm not surprised we don't have any. PRONOM lists 6.0 as the last version of WordPerfect...so I think we're good.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          More work to do, but I think we're done now for the initial added capability. Thank you, Pascal Essiembre. Thank you Rackspace for the vm which supported this work! Thank you govdocs1 and Common Crawl for the data!

          Show
          tallison@mitre.org Tim Allison added a comment - More work to do, but I think we're done now for the initial added capability. Thank you, Pascal Essiembre . Thank you Rackspace for the vm which supported this work! Thank you govdocs1 and Common Crawl for the data!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          As mentioned, I went with more specific identification of file version via mime types...the overall mime is still application/vnd.wordperfect. This leads only the right version of files to be directed to the new parsers. They will throw exceptions if they are somehow given the wrong version.

          This behavior seems to be inline for how we're handling file types for which we don't have parsers, which, in effect, is what we have here.

          I'd like to leave UnsupportedFormatException for those cases where we can't determine the mime before sending it to the general parser that should generally handle it.

          I'm happy to change this behavior if desired. Thank you, all!

          Show
          tallison@mitre.org Tim Allison added a comment - As mentioned, I went with more specific identification of file version via mime types...the overall mime is still application/vnd.wordperfect . This leads only the right version of files to be directed to the new parsers. They will throw exceptions if they are somehow given the wrong version. This behavior seems to be inline for how we're handling file types for which we don't have parsers, which, in effect, is what we have here. I'd like to leave UnsupportedFormatException for those cases where we can't determine the mime before sending it to the general parser that should generally handle it. I'm happy to change this behavior if desired. Thank you, all!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Much better. I put the updated extracts in http://vmaddress/share as well as the updated db.

          select mime_string, count(1) cnt
          from profiles p
          join mimes m on m.mime_type_id=p.mime_type_id
          group by mime_string
          order by cnt desc;
          
          MIME CNT
          application/vnd.wordperfect; version=5.1 299
          application/vnd.wordperfect; version=6.x 178
          application/x-quattro-pro; version=9 54
          application/x-quattro-pro; version=7-8 17
          application/vnd.wordperfect; version=5.0 9
          application/x-tika-msoffice 2

          14 parse exceptions.

          A few files look to have dodgy content.

          Try

          select * from contents c
          join profiles p on c.id = p.id
          order by token_length_mean;
          

          At least four appear to have missing charsets (note this only pulls back those where "todo" was in the top 10 terms...we'll need to do a full grep on the extracts).

          select * from contents c
          join profiles p on c.id = p.id
          where top_n_words like '%todo%';
          
          Show
          tallison@mitre.org Tim Allison added a comment - Much better. I put the updated extracts in http://vmaddress/share as well as the updated db. select mime_string, count(1) cnt from profiles p join mimes m on m.mime_type_id=p.mime_type_id group by mime_string order by cnt desc; MIME CNT application/vnd.wordperfect; version=5.1 299 application/vnd.wordperfect; version=6.x 178 application/x-quattro-pro; version=9 54 application/x-quattro-pro; version=7-8 17 application/vnd.wordperfect; version=5.0 9 application/x-tika-msoffice 2 14 parse exceptions. A few files look to have dodgy content. Try select * from contents c join profiles p on c.id = p.id order by token_length_mean; At least four appear to have missing charsets (note this only pulls back those where "todo" was in the top 10 terms...we'll need to do a full grep on the extracts). select * from contents c join profiles p on c.id = p.id where top_n_words like '%todo%';
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1168 (See https://builds.apache.org/job/Tika-trunk/1168/)
          TIKA-1946 – updates, add detection for wp 5.0 and 5.1, and quattropro (tallison: rev 84a37209e1c6e27491c9e828799919bc32f8787e)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testWordPerfect_5_0.wp
          • (edit) tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testWordPerfect_5_1.wp
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1168 (See https://builds.apache.org/job/Tika-trunk/1168/ ) TIKA-1946 – updates, add detection for wp 5.0 and 5.1, and quattropro (tallison: rev 84a37209e1c6e27491c9e828799919bc32f8787e) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java (add) tika-parsers/src/test/resources/test-documents/testWordPerfect_5_0.wp (edit) tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java (add) tika-parsers/src/test/resources/test-documents/testWordPerfect_5_1.wp
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, Nick! I found that we can specify version info in the magic and in the supported types. The soon-to-be-committed version correctly detects 5.0, 5.1 and 6.x, and sends 6.x to the WordPerfectParser and 5.0 and 5.1 to the EmptyParser. The WordPerfectParser has a check for 6.x and throws an UnsupportedFormatException if it somehow winds up with the wrong file type.

            <mime-type type="application/vnd.wordperfect">
              <acronym>WPD</acronym>
              <_comment>WordPerfect - Corel Word Processing</_comment>
              <tika:link>http://en.wikipedia.org/wiki/WordPerfect</tika:link>
              <tika:uti>com.corel.wordperfect.doc</tika:uti>
              <magic priority="50">
                <match value="application/vnd.wordperfect;" type="string" offset="0"/>
              </magic>
              <magic priority="40">
                <match value="0xFF575043" type="big32" offset="0"/> <!-- ÿWPC -->
              </magic>
          <!-- We have magic coverage for these, so we shouldn't need them
                  <glob pattern="*.wpd"/>
              <glob pattern="*.wp"/>
              <glob pattern="*.wp5"/>
              <glob pattern="*.wp6"/>
              <glob pattern="*.w60"/>
              <glob pattern="*.wp61"/>
              <glob pattern="*.wpt"/>
              -->
            </mime-type>
            <mime-type type="application/vnd.wordperfect;version=5.0">
              <sub-class-of type="application/vnd.wordperfect"/>
              <magic priority="50">
                <match value="0xFF575043" type="big32" offset="0"> <!-- ÿWPC -->
                  <match value="0x0000" type="big16" offset="10"/>
                </match>
              </magic>
            </mime-type>
            <mime-type type="application/vnd.wordperfect;version=5.1">
              <sub-class-of type="application/vnd.wordperfect"/>
              <magic priority="50">
                <match value="0xFF575043" type="big32" offset="0"> <!-- ÿWPC -->
                  <match value="0x0001" type="big16" offset="10"/>
                </match>
              </magic>
            </mime-type>
            <mime-type type="application/vnd.wordperfect;version=6.x">
              <!--TODO: figure out how to distinguish 6.x versions -->
              <sub-class-of type="application/vnd.wordperfect"/>
              <magic priority="50">
                <match value="0xFF575043" type="big32" offset="0"> <!-- ÿWPC -->
                  <match value="0x0201" type="big16" offset="10"/>
                </match>
              </magic>
            </mime-type>
          

          Does this work?

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, Nick! I found that we can specify version info in the magic and in the supported types. The soon-to-be-committed version correctly detects 5.0, 5.1 and 6.x, and sends 6.x to the WordPerfectParser and 5.0 and 5.1 to the EmptyParser. The WordPerfectParser has a check for 6.x and throws an UnsupportedFormatException if it somehow winds up with the wrong file type. <mime-type type="application/vnd.wordperfect"> <acronym>WPD</acronym> <_comment>WordPerfect - Corel Word Processing</_comment> <tika:link>http://en.wikipedia.org/wiki/WordPerfect</tika:link> <tika:uti>com.corel.wordperfect.doc</tika:uti> <magic priority="50"> <match value="application/vnd.wordperfect;" type="string" offset="0"/> </magic> <magic priority="40"> <match value="0xFF575043" type="big32" offset="0"/> <!-- ÿWPC --> </magic> <!-- We have magic coverage for these, so we shouldn't need them <glob pattern="*.wpd"/> <glob pattern="*.wp"/> <glob pattern="*.wp5"/> <glob pattern="*.wp6"/> <glob pattern="*.w60"/> <glob pattern="*.wp61"/> <glob pattern="*.wpt"/> --> </mime-type> <mime-type type="application/vnd.wordperfect;version=5.0"> <sub-class-of type="application/vnd.wordperfect"/> <magic priority="50"> <match value="0xFF575043" type="big32" offset="0"> <!-- ÿWPC --> <match value="0x0000" type="big16" offset="10"/> </match> </magic> </mime-type> <mime-type type="application/vnd.wordperfect;version=5.1"> <sub-class-of type="application/vnd.wordperfect"/> <magic priority="50"> <match value="0xFF575043" type="big32" offset="0"> <!-- ÿWPC --> <match value="0x0001" type="big16" offset="10"/> </match> </magic> </mime-type> <mime-type type="application/vnd.wordperfect;version=6.x"> <!--TODO: figure out how to distinguish 6.x versions --> <sub-class-of type="application/vnd.wordperfect"/> <magic priority="50"> <match value="0xFF575043" type="big32" offset="0"> <!-- ÿWPC --> <match value="0x0201" type="big16" offset="10"/> </match> </magic> </mime-type> Does this work?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Thank you! This is very helpful. Unless we have documentation on FBFF05003200, I wonder if we should just rely on major and minor version. From your spreadsheet, we have:

          5.0 00 00
          5.1 00 01
          6.x 02 01

          It looks like PRONOM largely agrees. Can't believe I didn't see the "signature" button yesterday...duh. They require 00 for encrypted and 00 for indexAreaPointer:

          5.0 FF575043{5}0A0000{2}0000 PRONOM

          5.1 FF575043{5}0A0001{2}0000 PRONOM

          5.2 no PRONOM signature

          6.1 FF575043{4}010A0201 PRONOM

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Thank you! This is very helpful. Unless we have documentation on FBFF05003200, I wonder if we should just rely on major and minor version. From your spreadsheet, we have: 5.0 00 00 5.1 00 01 6.x 02 01 It looks like PRONOM largely agrees. Can't believe I didn't see the "signature" button yesterday...duh. They require 00 for encrypted and 00 for indexAreaPointer: 5.0 FF575043{5}0A0000{2}0000 PRONOM 5.1 FF575043{5}0A0001{2}0000 PRONOM 5.2 no PRONOM signature 6.1 FF575043{4}010A0201 PRONOM
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          FYI, I found relevant information about 5.x file format. With the sample files you provided me with, I'll look at adapting the existing WordPerfect parser to support 5.x files as well.

          In your corpus, do you know if you have WP files with other versions as well?

          Show
          pascal.essiembre Pascal Essiembre added a comment - FYI, I found relevant information about 5.x file format. With the sample files you provided me with, I'll look at adapting the existing WordPerfect parser to support 5.x files as well. In your corpus, do you know if you have WP files with other versions as well?
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I also like the idea of an UnsupportedFormatException or equivalent instead of the more generic TikaException.

          Show
          pascal.essiembre Pascal Essiembre added a comment - I also like the idea of an UnsupportedFormatException or equivalent instead of the more generic TikaException.
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          In case you are curious, I am attaching a spreadsheet with my signature pattern findings.

          Show
          pascal.essiembre Pascal Essiembre added a comment - In case you are curious, I am attaching a spreadsheet with my signature pattern findings.
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I created a patch that will now throw a TikaException whenever a bunch of checks are not met.

          Also, thanks to the batch of WP files you shared and your latest spreadsheet with versions identified, I was able to compare the first 32 bytes of a bunch of files of different versions. I found out they all start with FF 57 50 43, but the rest varies, except for versions 5.x files. They all have the same signature at byte 17 to 22, which is FB FF 05 00 32 00. So I also added code to check for this signature and throw an exception if encountered, stating versions older than 6.0 are not yet supported. That should be tested against your corpus to validate my findings are correct.

          Feel free to accept/reject or modify as you wish.

          Show
          pascal.essiembre Pascal Essiembre added a comment - I created a patch that will now throw a TikaException whenever a bunch of checks are not met. Also, thanks to the batch of WP files you shared and your latest spreadsheet with versions identified, I was able to compare the first 32 bytes of a bunch of files of different versions. I found out they all start with FF 57 50 43, but the rest varies, except for versions 5.x files. They all have the same signature at byte 17 to 22, which is FB FF 05 00 32 00. So I also added code to check for this signature and throw an exception if encountered, stating versions older than 6.0 are not yet supported. That should be tested against your corpus to validate my findings are correct. Feel free to accept/reject or modify as you wish.
          Hide
          gagravarr Nick Burch added a comment -

          Ideally different file formats would have different mimetypes, but when there's a well-known mimetype shared by several file types which are logically similar but format-wise different, there's not much we can do...

          In the past, we have added our own tika-custom parent types, to group together a handful of similar well-known ones, but normally we have to go for type/format suffixes on well-known ones to clarify/differentiate when formats actually differ

          Maybe we should add something along the lines of EncryptedDocumentException for parsers to throw when they can't handle a specific file, I think for now we might have a few places where they just through a straight Tika exception. Perhaps UnsupportedFormatException ? That would also make it easier to catch and re-try with another parser, especially if we get the proper fallback / multiple parser stuff going on 2.x

          For now, adding formats/types to the mimetypes and have the parser claim just those seems a good first step though

          Show
          gagravarr Nick Burch added a comment - Ideally different file formats would have different mimetypes, but when there's a well-known mimetype shared by several file types which are logically similar but format-wise different, there's not much we can do... In the past, we have added our own tika-custom parent types, to group together a handful of similar well-known ones, but normally we have to go for type/format suffixes on well-known ones to clarify/differentiate when formats actually differ Maybe we should add something along the lines of EncryptedDocumentException for parsers to throw when they can't handle a specific file, I think for now we might have a few places where they just through a straight Tika exception. Perhaps UnsupportedFormatException ? That would also make it easier to catch and re-try with another parser, especially if we get the proper fallback / multiple parser stuff going on 2.x For now, adding formats/types to the mimetypes and have the parser claim just those seems a good first step though
          Hide
          gagravarr Nick Burch added a comment -

          I believe it's only normal to have non-ASF headers for code that we're importing from elsewhere (that's under a suitable license of course!). Generally for code we write, and code we receive as direct contributions, the headers should be the normal ASF ones. (eg Incubator code grants have the headers updated, see http://incubator.apache.org/guides/mentor.html#initial-clean-up)

          Show
          gagravarr Nick Burch added a comment - I believe it's only normal to have non-ASF headers for code that we're importing from elsewhere (that's under a suitable license of course!). Generally for code we write, and code we receive as direct contributions, the headers should be the normal ASF ones. (eg Incubator code grants have the headers updated, see http://incubator.apache.org/guides/mentor.html#initial-clean-up )
          Hide
          tallison@mitre.org Tim Allison added a comment -

          DROID/Pronom is able to tell the difference between 5.0, 5.1 and 6.x, but I can't seem to find mime definitions quickly on their site. So, y, I agree, we should throw an exception if < 6.x. I'm hesitant to have different mimes for different versions, but they are clearly very different formats.

          Show
          tallison@mitre.org Tim Allison added a comment - DROID/Pronom is able to tell the difference between 5.0, 5.1 and 6.x, but I can't seem to find mime definitions quickly on their site . So, y, I agree, we should throw an exception if < 6.x. I'm hesitant to have different mimes for different versions, but they are clearly very different formats.
          Hide
          pascal.essiembre Pascal Essiembre added a comment - - edited

          WordPerfect extensions vary quite a bit. But the parser I wrote is based on WP Version 6. I suspect it supports higher versions as well, but definitely not lower. According to this document, http://www.corel.com/content/pdf/wpx4/corel-wordperfect-office-X4-reviewers-guide.pdf (page 27) .wp extensions can be for both WP5.x and WP6.x so we can't rely on extension as an indicator of anything. Since the major version should definitely be 2, I agree we should use that to throw an exception when not that. I could not find enough evidence in my earlier research of older version signatures, other than 0xD0CF11E0A1B11AE1 for some, which conflicts with MS Word (maybe why it can open those). I wonder if we should remove or have separate entries for these mime times then in tika-miketypes.xml?

              <alias type="application/wordperfect"/>
              <alias type="application/wordperfect5.1"/>
          

          That's assuming application/wordperfect is for older versions as well.

          Show
          pascal.essiembre Pascal Essiembre added a comment - - edited WordPerfect extensions vary quite a bit. But the parser I wrote is based on WP Version 6. I suspect it supports higher versions as well, but definitely not lower. According to this document, http://www.corel.com/content/pdf/wpx4/corel-wordperfect-office-X4-reviewers-guide.pdf (page 27) .wp extensions can be for both WP5.x and WP6.x so we can't rely on extension as an indicator of anything. Since the major version should definitely be 2, I agree we should use that to throw an exception when not that. I could not find enough evidence in my earlier research of older version signatures, other than 0xD0CF11E0A1B11AE1 for some, which conflicts with MS Word (maybe why it can open those). I wonder if we should remove or have separate entries for these mime times then in tika-miketypes.xml? <alias type= "application/wordperfect" /> <alias type= "application/wordperfect5.1" /> That's assuming application/wordperfect is for older versions as well.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Not sure if the format compatible with the MSOffice parser. May be worth a try?

          No luck.

          Y. Let's throw an Exception for major version 0, or is there an earlier indicator that these aren't going to be parseable?

          Show
          tallison@mitre.org Tim Allison added a comment - Not sure if the format compatible with the MSOffice parser. May be worth a try? No luck. Y. Let's throw an Exception for major version 0, or is there an earlier indicator that these aren't going to be parseable?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Gah. No, you're right. Major version 2 vs. major version 0.

          Show
          tallison@mitre.org Tim Allison added a comment - Gah. No, you're right. Major version 2 vs. major version 0.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          If we look at 978849.wp and 506544.wp, both are apparently the same version as our testWordPerfect.wpd: filetype=10, productype=1, minorversion=1

          This is wrong, the major version for the ones that work is 2, as you point out above. Sorry. For at least these two test files, the major version is '0'.

          If we revert readWP() to read(), the parse finishes without exception, but the content is corrupt – çä.

          If the length that is stored in the header is meant to be close or even to equal the actual file length (testWordPerfect.wpd has a stored file size of 2395, but an actual file size of 2044), then something may already be going wrong in the header. File 506544.wp stores 17825842 as its length, but the file is actually only 3117. File 978849.wp has a stored length of 50, but an actual length of 1389.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited If we look at 978849.wp and 506544.wp, both are apparently the same version as our testWordPerfect.wpd: filetype=10, productype=1, minorversion=1 This is wrong, the major version for the ones that work is 2, as you point out above. Sorry. For at least these two test files, the major version is '0'. If we revert readWP() to read(), the parse finishes without exception, but the content is corrupt – çä . If the length that is stored in the header is meant to be close or even to equal the actual file length (testWordPerfect.wpd has a stored file size of 2395, but an actual file size of 2044), then something may already be going wrong in the header. File 506544.wp stores 17825842 as its length, but the file is actually only 3117. File 978849.wp has a stored length of 50, but an actual length of 1389.
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I also checked. Looks like a version issue. Files that pass are .wpd and when I debug, I see the "major version" number obtained by parsing the file is 2. Those that are .wp files have a major version of 0 which is wrong. .wpd files are more recent. .wp and wp[7654] are for older WordPerfect formats. Shall we throw a TikaException when an unsupported format is encountered? So far those that are invalid (older ones) do appear to open up fine in Word as you mention. Not sure if the format compatible with the MSOffice parser. May be worth a try?

          Show
          pascal.essiembre Pascal Essiembre added a comment - I also checked. Looks like a version issue. Files that pass are .wpd and when I debug, I see the "major version" number obtained by parsing the file is 2. Those that are .wp files have a major version of 0 which is wrong. .wpd files are more recent. .wp and wp [7654] are for older WordPerfect formats. Shall we throw a TikaException when an unsupported format is encountered? So far those that are invalid (older ones) do appear to open up fine in Word as you mention. Not sure if the format compatible with the MSOffice parser. May be worth a try?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Fix eof

          Show
          tallison@mitre.org Tim Allison added a comment - Fix eof
          Hide
          tallison@mitre.org Tim Allison added a comment -

          234 of the EOFs are triggered by

                          // Variable-Length Multi-Byte Functions
                          int subgroup = in.readWP();
                          int functionSize = in.readWPShort();
                          for (int i = 0; i < functionSize - 4; i++) {
                              in.readWP();
                          }
          
          Show
          tallison@mitre.org Tim Allison added a comment - 234 of the EOFs are triggered by // Variable-Length Multi-Byte Functions int subgroup = in.readWP(); int functionSize = in.readWPShort(); for (int i = 0; i < functionSize - 4; i++) { in.readWP(); }
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, for at least the three smallest govdocs1 files, Word opens them without complaining: 506544.wp, 978853.wp, 978849.wp.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, for at least the three smallest govdocs1 files, Word opens them without complaining: 506544.wp, 978853.wp, 978849.wp.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          My initial thought is wrong. CommonCrawl docs are often truncated at 1MB. However, the docs with the EOF exceptions include quite a few from govdocs1, which should be whole files, and many files from CommonCrawl are not near the 1MB threshold.

          select count(1) from profiles;
          select count(1) from parse_exceptions;
          

          yields 535 files with 265 exceptions

          select file_path, c.length 
          from parse_exceptions e
          join profiles p on e.id=p.id
          join containers c on c.container_id=p.container_id
          where orig_stack_trace not like '%QPWTextExtractor.java:179%'
          order by c.length desc
          

          yields the 248 files with the EOF.

          I'm going to take a look at some now.

          Show
          tallison@mitre.org Tim Allison added a comment - My initial thought is wrong. CommonCrawl docs are often truncated at 1MB. However, the docs with the EOF exceptions include quite a few from govdocs1, which should be whole files, and many files from CommonCrawl are not near the 1MB threshold. select count(1) from profiles; select count(1) from parse_exceptions; yields 535 files with 265 exceptions select file_path, c.length from parse_exceptions e join profiles p on e.id=p.id join containers c on c.container_id=p.container_id where orig_stack_trace not like '%QPWTextExtractor.java:179%' order by c.length desc yields the 248 files with the EOF. I'm going to take a look at some now.
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          So what would be the percentage that are parsed properly? Am I reading the results right that the majority failed due to EOF exceptions? Is there a way to confirm these are indeed all truncated files and whether at least some content was extracted for them? Given how many EOF exceptions there are, I wonder if in some cases it may be the error reported when encountering an unsupported file versions?

          Show
          pascal.essiembre Pascal Essiembre added a comment - So what would be the percentage that are parsed properly? Am I reading the results right that the majority failed due to EOF exceptions? Is there a way to confirm these are indeed all truncated files and whether at least some content was extracted for them? Given how many EOF exceptions there are, I wonder if in some cases it may be the error reported when encountering an unsupported file versions?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #187 (See https://builds.apache.org/job/tika-2.x/187/)
          TIKA-1946 – initial commit to add parsers for WordPerfect and (tallison: rev 4383e3da783acae7b05cf3b6a452a16303fb7483)

          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java
          • (edit) CHANGES.txt
          • (add) tika-core/src/main/java/org/apache/tika/metadata/WordPerfect.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (add) tika-core/src/main/java/org/apache/tika/metadata/QuattroPro.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WPInputStreamTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect.wpd
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #187 (See https://builds.apache.org/job/tika-2.x/187/ ) TIKA-1946 – initial commit to add parsers for WordPerfect and (tallison: rev 4383e3da783acae7b05cf3b6a452a16303fb7483) (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java (edit) CHANGES.txt (add) tika-core/src/main/java/org/apache/tika/metadata/WordPerfect.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (add) tika-core/src/main/java/org/apache/tika/metadata/QuattroPro.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WPInputStreamTest.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect.wpd
          Hide
          tallison@mitre.org Tim Allison added a comment -

          And for exceptions:

          select sort_stack_trace, count(1) as cnt from parse_exceptions
          group by sort_stack_trace
          order by cnt desc;
          

          Exceptions look to be caused by truncated files (very common in our common crawl data).

          These 17 exceptions are from non-supported QuattroPro:

          select * from parse_exceptions e
          join profiles p on p.id = e.id
          join containers c on p.container_id=c.container_id
          where orig_stack_trace like '%QPWTextExtractor.java:179%'
          
          Show
          tallison@mitre.org Tim Allison added a comment - And for exceptions: select sort_stack_trace, count(1) as cnt from parse_exceptions group by sort_stack_trace order by cnt desc; Exceptions look to be caused by truncated files (very common in our common crawl data). These 17 exceptions are from non-supported QuattroPro: select * from parse_exceptions e join profiles p on p.id = e.id join containers c on p.container_id=c.container_id where orig_stack_trace like '%QPWTextExtractor.java:179%'
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #88 (See https://builds.apache.org/job/tika-2.x-windows/88/)
          TIKA-1946 – initial commit to add parsers for WordPerfect and (tallison: rev 4383e3da783acae7b05cf3b6a452a16303fb7483)

          • (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect.wpd
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (add) tika-core/src/main/java/org/apache/tika/metadata/QuattroPro.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (add) tika-core/src/main/java/org/apache/tika/metadata/WordPerfect.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WPInputStreamTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          • (edit) CHANGES.txt
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #88 (See https://builds.apache.org/job/tika-2.x-windows/88/ ) TIKA-1946 – initial commit to add parsers for WordPerfect and (tallison: rev 4383e3da783acae7b05cf3b6a452a16303fb7483) (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java (add) tika-test-resources/src/test/resources/test-documents/testWordPerfect.wpd (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (add) tika-core/src/main/java/org/apache/tika/metadata/QuattroPro.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (add) tika-core/src/main/java/org/apache/tika/metadata/WordPerfect.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WPInputStreamTest.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java (edit) CHANGES.txt (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          H2 works for me. I downloaded the files you shared and will give it a shot when I have a chance. Thanks for sharing.

          Show
          pascal.essiembre Pascal Essiembre added a comment - H2 works for me. I downloaded the files you shared and will give it a shot when I have a chance. Thanks for sharing.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          We use it to run comparisons before releases for POI, PDFBox and Tika. I also used it to compare 'file', Droid and Tika on mime detection.

          We also use it when developing new parsers, of course. Should have checked for relevant files earlier...Sorry.

          If you are familiar with h2, this query is informative on the db that I shared:

          select * from contents c
          join profiles p on c.id=p.id
          join containers cont on p.container_id=cont.container_id
          order by lang_id_prob_1
          

          If you aren't familiar with it and want to give it a shot, download the jar and kickoff the web interface:
          java -cp h2.jar org.h2.tools.Console -web

          for the db, enter the full path to the file and remove .mv.db from the name, e.g.:

          jdbc:h2:/C:/data/working/eval_work/wordperfect

          Show
          tallison@mitre.org Tim Allison added a comment - - edited We use it to run comparisons before releases for POI, PDFBox and Tika. I also used it to compare 'file', Droid and Tika on mime detection. We also use it when developing new parsers, of course. Should have checked for relevant files earlier...Sorry. If you are familiar with h2, this query is informative on the db that I shared: select * from contents c join profiles p on c.id=p.id join containers cont on p.container_id=cont.container_id order by lang_id_prob_1 If you aren't familiar with it and want to give it a shot, download the jar and kickoff the web interface: java -cp h2.jar org.h2.tools.Console -web for the db, enter the full path to the file and remove .mv.db from the name, e.g.: jdbc:h2:/C:/data/working/eval_work/wordperfect
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I am not sure when I may have time to benefit from it, but it is good to know the offer is there. What is the VM used for exactly? Mainly as a way to access the files only or also as a sandbox to run tests on the corpus?

          Show
          pascal.essiembre Pascal Essiembre added a comment - I am not sure when I may have time to benefit from it, but it is good to know the offer is there. What is the VM used for exactly? Mainly as a way to access the files only or also as a sandbox to run tests on the corpus?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          If access to the vm would help you help us, let me know and I'll give you a login.

          Many thanks to Rackspace for hosting this vm for us!!!

          Show
          tallison@mitre.org Tim Allison added a comment - If access to the vm would help you help us, let me know and I'll give you a login. Many thanks to Rackspace for hosting this vm for us!!!
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          Thanks!

          Show
          pascal.essiembre Pascal Essiembre added a comment - Thanks!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          It is ephemeral, but available.

          Main site: http://162.242.228.174/

          All the docs are available here: http://162.242.228.174/docs

          I bundled the wordperfect/quattropro docs and extracts into tar.bz2 files here for you:
          http://162.242.228.174/share

          I included the h2 database that gathered some metrics on the runs against our files under the share directory.

          Quite a few are truncated or not the right version.

          Show
          tallison@mitre.org Tim Allison added a comment - It is ephemeral, but available. Main site: http://162.242.228.174/ All the docs are available here: http://162.242.228.174/docs I bundled the wordperfect/quattropro docs and extracts into tar.bz2 files here for you: http://162.242.228.174/share I included the h2 database that gathered some metrics on the runs against our files under the share directory. Quite a few are truncated or not the right version.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1166 (See https://builds.apache.org/job/Tika-trunk/1166/)
          New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by (pascal.essiembre: rev 87c2ef3191d0a86502dc249240022b3cc973aaa4)

          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java
          • (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroPro.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (add) tika-parsers/src/test/resources/test-documents/testWordPerfect.wpd
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfect.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
            TIKA-1946 – initial commit of QuattroPro and WordPerfect parsers. Many (tallison: rev d011d708c21669759af86e855b61d98dae19492e)
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (add) tika-core/src/main/java/org/apache/tika/metadata/QuattroPro.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfect.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WPInputStreamTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java
          • (add) tika-core/src/main/java/org/apache/tika/metadata/WordPerfect.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroPro.java
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1166 (See https://builds.apache.org/job/Tika-trunk/1166/ ) New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by (pascal.essiembre: rev 87c2ef3191d0a86502dc249240022b3cc973aaa4) (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroPro.java (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (add) tika-parsers/src/test/resources/test-documents/testWordPerfect.wpd (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfect.java (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfectParser.java (add) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroProParser.java (add) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (add) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java TIKA-1946 – initial commit of QuattroPro and WordPerfect parsers. Many (tallison: rev d011d708c21669759af86e855b61d98dae19492e) (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6Constants.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (add) tika-core/src/main/java/org/apache/tika/metadata/QuattroPro.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WordPerfect.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6TextExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WordPerfectTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/QuattroProTest.java (add) tika-parsers/src/test/java/org/apache/tika/parser/wordperfect/WPInputStreamTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WPInputStream.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QPWTextExtractor.java (add) tika-core/src/main/java/org/apache/tika/metadata/WordPerfect.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/QuattroPro.java (edit) CHANGES.txt (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6FileHeader.java
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          You are welcome! I am glad to contribute to such a great product.

          We may have to do some adjustments to do if the WordPerfect parser tries to parse versions it does not support.

          Is your regression corpus publicly accessible? I could use it to do more tests before submitting new work in the future. I would have loved to have 600 WordPerfect files to play with when writing the parser!

          Show
          pascal.essiembre Pascal Essiembre added a comment - You are welcome! I am glad to contribute to such a great product. We may have to do some adjustments to do if the WordPerfect parser tries to parse versions it does not support. Is your regression corpus publicly accessible? I could use it to do more tests before submitting new work in the future. I would have loved to have 600 WordPerfect files to play with when writing the parser!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Many thanks, again, Pascal Essiembre for these parsers! Your code was in great shape. I switched out 3 metadata items for DublinCore and/or MSOffice. I hardened the EOF checking when a byte is expected. That was about it.

          I look forward to running these against our regression corpus...according to DROID we have ~600 WordPerfect files and ~75 quattro-pro.

          Thank you, again!

          Show
          tallison@mitre.org Tim Allison added a comment - Many thanks, again, Pascal Essiembre for these parsers! Your code was in great shape. I switched out 3 metadata items for DublinCore and/or MSOffice. I hardened the EOF checking when a byte is expected. That was about it. I look forward to running these against our regression corpus...according to DROID we have ~600 WordPerfect files and ~75 quattro-pro. Thank you, again!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          That's right. Thank you.

          Show
          tallison@mitre.org Tim Allison added a comment - That's right. Thank you.
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I noticed you have some corporate copyrights notices already. For instance, the CharsetDetector and other classes in org.apache.tika.parser.txt package (IBM).

          Show
          pascal.essiembre Pascal Essiembre added a comment - I noticed you have some corporate copyrights notices already. For instance, the CharsetDetector and other classes in org.apache.tika.parser.txt package (IBM).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/141

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/141
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I am OK to remove it as you are correct, .wb2 is not handled by the parser I submitted. I am less familiar with Lotus 123 so I cannot validate what you propose but it appears to be a good idea.

          Show
          pascal.essiembre Pascal Essiembre added a comment - I am OK to remove it as you are correct, .wb2 is not handled by the parser I submitted. I am less familiar with Lotus 123 so I cannot validate what you propose but it appears to be a good idea.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ah, ok, so 0x00000200 is for wb2 according to Gary Kessler's file sigs. If I understand your note correctly above, the QuattroProParser doesn't handle those...let's remove it from the mime, or, better, add more bytes to the x-123 at a higher priority to get the different versions of Lotus and then backoff to QuattroPro wb2 if there isn't a match on x-123?

          Show
          tallison@mitre.org Tim Allison added a comment - Ah, ok, so 0x00000200 is for wb2 according to Gary Kessler's file sigs . If I understand your note correctly above, the QuattroProParser doesn't handle those...let's remove it from the mime, or, better, add more bytes to the x-123 at a higher priority to get the different versions of Lotus and then backoff to QuattroPro wb2 if there isn't a match on x-123 ?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          On the mime detection for quattro-pro:

              <!-- Conflicts with MS Word .doc format:
              <magic priority="90">
                <match value="0xD0CF11E0A1B11AE1" type="string" offset="0"/>
              </magic>
               -->
              <magic priority="50">
                <match value="0x00000200" type="big32" offset="0"/>
              </magic>
          

          We are checking for NativeContent_MAIN in our POIFSContainerDetector so I'm not sure we need any magic. Sorry for the ignorance, do we need 0x00000200...or which version is that for? We have that for Lotus 123...will this parser handle that? How about 0x00001a00?

          If the QuattroPro parser will handle only the first, let's remove that from quattro-pro, and add x-123 to the supported types for the QuattroPro parser? Perhaps differentiate the ...1a00 via mime type?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited On the mime detection for quattro-pro: <!-- Conflicts with MS Word .doc format: <magic priority="90"> <match value="0xD0CF11E0A1B11AE1" type="string" offset="0"/> </magic> --> <magic priority="50"> <match value="0x00000200" type="big32" offset="0"/> </magic> We are checking for NativeContent_MAIN in our POIFSContainerDetector so I'm not sure we need any magic. Sorry for the ignorance, do we need 0x00000200 ...or which version is that for? We have that for Lotus 123...will this parser handle that? How about 0x00001a00 ? If the QuattroPro parser will handle only the first, let's remove that from quattro-pro, and add x-123 to the supported types for the QuattroPro parser? Perhaps differentiate the ...1a00 via mime type?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Overall, this looks great. I'll modify some of the metadata keys to map to dublin core, or potentially office where appropriate.

          To fellow Tika devs, are we ok with a corporate copyright notice in the license header?

          Show
          tallison@mitre.org Tim Allison added a comment - Overall, this looks great. I'll modify some of the metadata keys to map to dublin core, or potentially office where appropriate. To fellow Tika devs, are we ok with a corporate copyright notice in the license header?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          No problem at all. Off I go...

          Show
          tallison@mitre.org Tim Allison added a comment - No problem at all. Off I go...
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          I am sorry Tim Allison, I am currently out of my city and will not have time in the near future. And I still have to configure keys and other things. I want to test my committing privilegies with some small commits I have in my mind first. Thank you!

          Show
          lfcnassif Luis Filipe Nassif added a comment - I am sorry Tim Allison , I am currently out of my city and will not have time in the near future. And I still have to configure keys and other things. I want to test my committing privilegies with some small commits I have in my mind first. Thank you!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Luis Filipe Nassif, would you like the honors of reviewing and committing?

          Show
          tallison@mitre.org Tim Allison added a comment - Luis Filipe Nassif , would you like the honors of reviewing and committing?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          For dbf we did something like this: application/x-dbf; format=Visual_FoxPro, application/x-dbf; format=dBASE_IV; type=memo"...maybe application/x-quattro-pro-wb3 or application/x-quattro-pro; format=wb3?

          Nick Burch, any recommendations?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited For dbf we did something like this: application/x-dbf; format=Visual_FoxPro , application/x-dbf; format=dBASE_IV; type=memo" ...maybe application/x-quattro-pro-wb3 or application/x-quattro-pro; format=wb3 ? Nick Burch , any recommendations?
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          It now throws a TikaException as you suggest. For child mime-types, I am not sure what they would be. Given different QuattroPro formats seem to share the same mimetype, would we come up with some? I am not sure what's the general practice in this case.

          Show
          pascal.essiembre Pascal Essiembre added a comment - It now throws a TikaException as you suggest. For child mime-types, I am not sure what they would be. Given different QuattroPro formats seem to share the same mimetype, would we come up with some? I am not sure what's the general practice in this case.
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Thank you, Pascal!

          I think it may be better to throw a TikaException when parsing unsupported files, so client code will know that and can take other action, eg run a fallback parser like Latin1StringsParser.

          If the files have different magic it would be better to break the mimetype into child ones and configure the parser only with the supported child mimetype.

          Show
          lfcnassif Luis Filipe Nassif added a comment - Thank you, Pascal! I think it may be better to throw a TikaException when parsing unsupported files, so client code will know that and can take other action, eg run a fallback parser like Latin1StringsParser. If the files have different magic it would be better to break the mimetype into child ones and configure the parser only with the supported child mimetype.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          W00t! Christmas came early. I'll take a look tomorrow. Thank you!

          Show
          tallison@mitre.org Tim Allison added a comment - W00t! Christmas came early. I'll take a look tomorrow. Thank you!
          Hide
          pascal.essiembre Pascal Essiembre added a comment - - edited

          I finally had a bit of time to port the WordPerfect parser to the project. I also added a Quattro Pro parser (also from WordPerfect Office suite). As it is the first time I make a pull-request for Tika, let me know if anything is not proper.

          The QuattroPro parser only supports .qpw files, but since other QuatroPro formats share the same mime-type, the parser will be invoked for other formats as well (.wb?). I added a check in the parser code that will simply log a message stating the format is unsupported when encountered. If you have a better approach to suggest let me know.

          Show
          pascal.essiembre Pascal Essiembre added a comment - - edited I finally had a bit of time to port the WordPerfect parser to the project. I also added a Quattro Pro parser (also from WordPerfect Office suite). As it is the first time I make a pull-request for Tika, let me know if anything is not proper. The QuattroPro parser only supports .qpw files, but since other QuatroPro formats share the same mime-type, the parser will be invoked for other formats as well (.wb?). I added a check in the parser code that will simply log a message stating the format is unsupported when encountered. If you have a better approach to suggest let me know.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user essiembre opened a pull request:

          https://github.com/apache/tika/pull/141

          New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/essiembre/tika TIKA-1946

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/141.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #141


          commit 87c2ef3191d0a86502dc249240022b3cc973aaa4
          Author: Pascal Essiembre <pascal.essiembre@norconex.com>
          Date: 2016-12-20T20:42:39Z

          New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by
          pascal.essiembre


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user essiembre opened a pull request: https://github.com/apache/tika/pull/141 New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre You can merge this pull request into a Git repository by running: $ git pull https://github.com/essiembre/tika TIKA-1946 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #141 commit 87c2ef3191d0a86502dc249240022b3cc973aaa4 Author: Pascal Essiembre <pascal.essiembre@norconex.com> Date: 2016-12-20T20:42:39Z New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre
          Hide
          nicholasc Nick C added a comment -

          Some is always better than none. Could also try to use wpd2html from libwpd which OpenOffice/LibreOffice use if it's installed; if not fallback to the java code

          Show
          nicholasc Nick C added a comment - Some is always better than none. Could also try to use wpd2html from libwpd which OpenOffice/LibreOffice use if it's installed; if not fallback to the java code
          Hide
          gagravarr Nick Burch added a comment -

          I agree - Some text is better than waiting forever for a "perfect" set of text with nothing in the mean time!

          Show
          gagravarr Nick Burch added a comment - I agree - Some text is better than waiting forever for a "perfect" set of text with nothing in the mean time!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Fantastic! Unless others disagree, I strongly support "some > none". trunk would be great. Thank you!

          Show
          tallison@mitre.org Tim Allison added a comment - Fantastic! Unless others disagree, I strongly support "some > none". trunk would be great. Thank you!
          Hide
          pascal.essiembre Pascal Essiembre added a comment -

          I certainly can, but they ain't perfect as I do not think they support all possible versions of the file formats. If you feel some WP/QP support is better than none, I will try to provide a github pull-request when I have free cycles to do it. Which branch should I be targeting for this contribution?

          Show
          pascal.essiembre Pascal Essiembre added a comment - I certainly can, but they ain't perfect as I do not think they support all possible versions of the file formats. If you feel some WP/QP support is better than none, I will try to provide a github pull-request when I have free cycles to do it. Which branch should I be targeting for this contribution?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Pascal Essiembre, any interest in contributing your WordPerfect and QuattroPro parsers to Tika?

          Show
          tallison@mitre.org Tim Allison added a comment - Pascal Essiembre , any interest in contributing your WordPerfect and QuattroPro parsers to Tika?
          Hide
          gagravarr Nick Burch added a comment -

          Looks to be Apache licensed, but the WordPerfect code is mixed into with their own Tika tweaks and their own parsing tool, so not that easy to re-use as-is

          Probably best if someone reached out to the author, and asked if they'd mind contributing their WordPerfect parser and QuattroPro parser to Tika?

          Show
          gagravarr Nick Burch added a comment - Looks to be Apache licensed, but the WordPerfect code is mixed into with their own Tika tweaks and their own parsing tool, so not that easy to re-use as-is Probably best if someone reached out to the author, and asked if they'd mind contributing their WordPerfect parser and QuattroPro parser to Tika?

            People

            • Assignee:
              Unassigned
              Reporter:
              nicholasc Nick C
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development