Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1958

Add mime detection and lightweight parsers for Office 2003 Word and Excel formats

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.14
    • Component/s: None
    • Labels:
      None

      Description

      Over on POI, a user asked if we supported 2003 xls (xml) files. It would be neat if we could add mime detection and a "good enough" parser to handle 2003 xls and doc files.

      This could be a great task for someone wanting to get started in contributing to Tika.

      references:
      https://mail-archives.apache.org/mod_mbox/poi-user/201604.mbox/%3Calpine.BSO.2.20.1604210825140.22929%40ref.nmedia.net%3E
      https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats
      https://msdn.microsoft.com/en-us/library/bb226687(v=office.11).aspx

      1. 2010-cal-eu.xls
        82 kB
        Tim Allison
      2. excel_msword_2003.tar.bz2
        4 kB
        Tim Allison

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Output of grep on our corpus as it is today. We have several handfuls of both Excel and Word 2003 (single XML) files there.

        Show
        tallison@mitre.org Tim Allison added a comment - Output of grep on our corpus as it is today. We have several handfuls of both Excel and Word 2003 (single XML) files there.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Original file submitted via link in POI user's mail.

        Show
        tallison@mitre.org Tim Allison added a comment - Original file submitted via link in POI user's mail.
        Hide
        gagravarr Nick Burch added a comment -

        The format for the Excel XML file is broadly similar (but much simpler) than the OOXML one, so potentially a similar set of SAX code could be used. Otherwise, given the relatively simple set of XML used, perhaps the approach take for the ODF formats could be used?

        Show
        gagravarr Nick Burch added a comment - The format for the Excel XML file is broadly similar (but much simpler) than the OOXML one, so potentially a similar set of SAX code could be used. Otherwise, given the relatively simple set of XML used, perhaps the approach take for the ODF formats could be used?
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Y, that's what I was thinking. For mime detection, can we specify the processing instruction somehow <?mso-application progid="Excel.Sheet"?> or should we go with <root-XML localName="Workbook"/>

        Also, what do we want to call the mime type: application/wordmlapplication/spreadsheetml?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Y, that's what I was thinking. For mime detection, can we specify the processing instruction somehow <?mso-application progid="Excel.Sheet"?> or should we go with <root-XML localName="Workbook"/> Also, what do we want to call the mime type: application/wordml application/spreadsheetml ?
        Hide
        gagravarr Nick Burch added a comment -

        On the detection, can't remember, probably best just try + unit test!

        For the mime type, I'd suggest something like application/vnd.ms-spreadsheetml to be more in keeping with our other related formats

        Show
        gagravarr Nick Burch added a comment - On the detection, can't remember, probably best just try + unit test! For the mime type, I'd suggest something like application/vnd.ms-spreadsheetml to be more in keeping with our other related formats
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Better grep, cleaner results. There's even one InfoPath 2003 doc in our corpus.

        I have an initial patch for Word and Excel. Some work remains. I'll commit once we have a successful vote for 1.13.

        Show
        tallison@mitre.org Tim Allison added a comment - Better grep, cleaner results. There's even one InfoPath 2003 doc in our corpus. I have an initial patch for Word and Excel. Some work remains. I'll commit once we have a successful vote for 1.13.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #991 (See https://builds.apache.org/job/tika-trunk-jdk1.7/991/)
        TIKA-1958 - add mime detection and parsers for 2003 MSWord XML (wordml) (tallison: rev bc0b1f7f7e0b854a119779fc3f806e0d9490c08a)

        • CHANGES.txt
        • tika-parsers/src/test/resources/test-documents/testWORD2003.xml
        • tika-parsers/src/test/resources/test-documents/testEXCEL2003.xml
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/HyperlinkHandler.java
        • tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
        • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/xml/XML2003ParserTest.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/WordMLParser.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #991 (See https://builds.apache.org/job/tika-trunk-jdk1.7/991/ ) TIKA-1958 - add mime detection and parsers for 2003 MSWord XML (wordml) (tallison: rev bc0b1f7f7e0b854a119779fc3f806e0d9490c08a) CHANGES.txt tika-parsers/src/test/resources/test-documents/testWORD2003.xml tika-parsers/src/test/resources/test-documents/testEXCEL2003.xml tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/HyperlinkHandler.java tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-parsers/src/test/java/org/apache/tika/parser/microsoft/xml/XML2003ParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/WordMLParser.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #93 (See https://builds.apache.org/job/tika-2.x/93/)
        TIKA-1958: add mime detection and parsers for MSOffice 2003 wordml and (tallison: rev a882a3242f4c94728a0129643bb52381e0e4c096)

        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/WordMLParser.java
        • tika-test-resources/src/test/resources/test-documents/testEXCEL2003.xml
        • tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • tika-test-resources/src/test/resources/test-documents/testWORD2003.xml
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/HyperlinkHandler.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/xml/XML2003ParserTest.java
        • CHANGES.txt
        • tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
        • tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #93 (See https://builds.apache.org/job/tika-2.x/93/ ) TIKA-1958 : add mime detection and parsers for MSOffice 2003 wordml and (tallison: rev a882a3242f4c94728a0129643bb52381e0e4c096) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/SpreadsheetMLParser.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/WordMLParser.java tika-test-resources/src/test/resources/test-documents/testEXCEL2003.xml tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml tika-test-resources/src/test/resources/test-documents/testWORD2003.xml tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/HyperlinkHandler.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/xml/XML2003ParserTest.java CHANGES.txt tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development