Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1601

Integrate Jackcess to handle MSAccess files

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Recently, James Ahlborn, the current maintainer of Jackcess, kindly agreed to relicense Jackcess to Apache 2.0. Brian ONeill, the CTO at Health Market Science, a LexisNexis® Company, also agreed with this relicensing and led the charge to obtain all necessary corporate approval to deliver a CCLA for Jackcess to Apache. As anyone who has tried to get corporate approval for anything knows, this can sometimes require not a small bit of effort.

      If I may speak on behalf of Tika and the larger Apache community, I offer a sincere thanks to James, Brian and the other developers and contributors to Jackcess!!!

      Once the licensing info has been changed in Jackcess and the new release is available in maven, we can integrate Jackcess into Tika and add a capability to process MSAccess.

      As a side note, I reached out to the developers and contributors to determine if there were any objections. I couldn't find addresses for everyone, and not everyone replied, but those who did offered their support to this move.

      1. jackcess_nocommit_v1.patch
        32 kB
        Tim Allison
      2. testAccess2.zip
        362 kB
        Tim Allison

        Activity

        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        Hi Tim,

        I already have a parser based on jackcess. If you can wait, I can submit a preliminary patch next week.

        Show
        lfcnassif Luis Filipe Nassif added a comment - Hi Tim, I already have a parser based on jackcess. If you can wait, I can submit a preliminary patch next week.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Great! We have to wait for the change in license headers and a new release. Please do submit your parser whenever you have a chance, but there's no rush. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Great! We have to wait for the change in license headers and a new release. Please do submit your parser whenever you have a chance, but there's no rush. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Hi Luis Filipe Nassif, James Ahlborn carried out the relicensing and released Jackcess 2.1.0 under ASL 2.0. If you'd be able to submit your patch, that'd be great. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Hi Luis Filipe Nassif , James Ahlborn carried out the relicensing and released Jackcess 2.1.0 under ASL 2.0. If you'd be able to submit your patch, that'd be great. Thank you!
        Hide
        lfcnassif Luis Filipe Nassif added a comment -

        Great! Give me more 3 days to submit the patch. Do you have some Apache 2 MDB file for unit tests?

        Show
        lfcnassif Luis Filipe Nassif added a comment - Great! Give me more 3 days to submit the patch. Do you have some Apache 2 MDB file for unit tests?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I don't. That's half the fun of a patch, right. On the sqlite parser, I tried to have a least one column for each data type, nonascii language to confirm no encoding problems and an embedded doc.

        Happy to generate this if it would help. Thank you, again.

        Show
        tallison@mitre.org Tim Allison added a comment - I don't. That's half the fun of a patch, right. On the sqlite parser, I tried to have a least one column for each data type, nonascii language to confirm no encoding problems and an embedded doc. Happy to generate this if it would help. Thank you, again.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Hi Luis Filipe Nassif, have you had a chance to work on this at all? Let me know if a test file or two would be of use. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Hi Luis Filipe Nassif , have you had a chance to work on this at all? Let me know if a test file or two would be of use. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Simple test files attached. Luis Filipe Nassif, let me know if you'd still like to contribute your parser. If not, I'll start from scratch. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Simple test files attached. Luis Filipe Nassif , let me know if you'd still like to contribute your parser. If not, I'll start from scratch. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Not anywhere near committing, but this is a rough start.

        Some TODOs:

        • Figure out how to get non-ascii text out correctly
        • Figure out how to grab attachments from the accdb file
        • Figure out if there's a flag for html-marked up text cells so that we can strip the markup [0]
        • Figure out if there's a way to prevent Jackcess from trying to open linked files [0]
        • Add unit tests

        I used Dominik Stadler's code [1] to pull ~3k mdb files from CommonCrawl for testing. Those tests were invaluable for identifying a potentially serious security issue – default behavior of the table iterator was to try to load linked files. Our code is now configured to skip linked tables.

        Many thanks, again, to James Ahlborn for his patience in answering the above.

        [0]: https://sourceforge.net/p/jackcess/discussion/456474/thread/038878e6/
        [1]: https://github.com/centic9/CommonCrawlDocumentDownload

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Not anywhere near committing, but this is a rough start. Some TODOs: Figure out how to get non-ascii text out correctly Figure out how to grab attachments from the accdb file Figure out if there's a flag for html-marked up text cells so that we can strip the markup [0] Figure out if there's a way to prevent Jackcess from trying to open linked files [0] Add unit tests I used Dominik Stadler 's code [1] to pull ~3k mdb files from CommonCrawl for testing. Those tests were invaluable for identifying a potentially serious security issue – default behavior of the table iterator was to try to load linked files. Our code is now configured to skip linked tables. Many thanks, again, to James Ahlborn for his patience in answering the above. [0] : https://sourceforge.net/p/jackcess/discussion/456474/thread/038878e6/ [1] : https://github.com/centic9/CommonCrawlDocumentDownload
        Hide
        tallison@mitre.org Tim Allison added a comment -

        r1688337.

        There are some areas for improvement in our wrapper of Jackcess, but I'll open separate issues for those.

        Thank you, again, Brian ONeill and James Ahlborn!

        Show
        tallison@mitre.org Tim Allison added a comment - r1688337. There are some areas for improvement in our wrapper of Jackcess, but I'll open separate issues for those. Thank you, again, Brian ONeill and James Ahlborn!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #774 (See https://builds.apache.org/job/tika-trunk-jdk1.7/774/)
        TIKA-1601: integrate Jackcess to parse MSAccess files (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688337)

        • /tika/trunk/CHANGES.txt
        • /tika/trunk/tika-bundle/pom.xml
        • /tika/trunk/tika-parsers/pom.xml
        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java
        • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessParser.java
        • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
        • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/JackcessParserTest.java
        • /tika/trunk/tika-parsers/src/test/resources/test-documents/testAccess2.accdb
        • /tika/trunk/tika-parsers/src/test/resources/test-documents/testAccess2_2000.mdb
        • /tika/trunk/tika-parsers/src/test/resources/test-documents/testAccess2_2002-2003.mdb
        • /tika/trunk/tika-server/pom.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #774 (See https://builds.apache.org/job/tika-trunk-jdk1.7/774/ ) TIKA-1601 : integrate Jackcess to parse MSAccess files (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688337 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-bundle/pom.xml /tika/trunk/tika-parsers/pom.xml /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessParser.java /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/JackcessParserTest.java /tika/trunk/tika-parsers/src/test/resources/test-documents/testAccess2.accdb /tika/trunk/tika-parsers/src/test/resources/test-documents/testAccess2_2000.mdb /tika/trunk/tika-parsers/src/test/resources/test-documents/testAccess2_2002-2003.mdb /tika/trunk/tika-server/pom.xml

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development