Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2254

Provide chart support for MS Office documents

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 1.16
    • Component/s: core
    • Labels:
      None

      Description

      I would like to be able to extract text from the charts in XLSX documents. Tim Allison has recommended that I raise this ticket which should cover it:

      On 23 Jan 2017, at 12:18, Allison, Timothy B. <tallison@mitre.org> wrote:
      
      Hi Chris,
      Good to hear from you. I don't know if it would help at all, but I'm planning to add chart support to Tika soon(ish). I haven't yet opened the ticket on Tika's JIRA, so please open one there too if that would be of any use to you.
      
      Best,
      
      Tim
      
      1. ForChrisB.xlsx
        31 kB
        Chris Bamford

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Any chance you could share a test file? Thank you!

          Show
          tallison@mitre.org Tim Allison added a comment - Any chance you could share a test file? Thank you!
          Hide
          bammers Chris Bamford added a comment -

          Attached a chart file.

          Show
          bammers Chris Bamford added a comment - Attached a chart file.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Tika-trunk #1289 (See https://builds.apache.org/job/Tika-trunk/1289/)
          TIKA-2254 – extract text from charts in ooxml. (tallison: https://github.com/apache/tika/commit/d2820ce62545c847a2d3e79b7b4b8a3f2022a619)

          • (add) tika-parsers/src/test/resources/test-documents/testPPT_charts.pptx
          • (add) tika-parsers/src/test/resources/test-documents/testEXCEL_charts.xlsb
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • (add) tika-parsers/src/test/resources/test-documents/testEXCEL_charts.xlsx
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_charts.docx
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Tika-trunk #1289 (See https://builds.apache.org/job/Tika-trunk/1289/ ) TIKA-2254 – extract text from charts in ooxml. (tallison: https://github.com/apache/tika/commit/d2820ce62545c847a2d3e79b7b4b8a3f2022a619 ) (add) tika-parsers/src/test/resources/test-documents/testPPT_charts.pptx (add) tika-parsers/src/test/resources/test-documents/testEXCEL_charts.xlsb (edit) CHANGES.txt (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXSLFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (add) tika-parsers/src/test/resources/test-documents/testEXCEL_charts.xlsx (add) tika-parsers/src/test/resources/test-documents/testWORD_charts.docx (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLWordAndPowerPointTextHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFBExcelExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Somewhat related issues

          Show
          tallison@mitre.org Tim Allison added a comment - Somewhat related issues
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Added basic extraction. We're currently skipping formula (<c:f>) elements, and we're not recording the full file paths, which can be stored for the underlying/source xlsx file.

          Show
          tallison@mitre.org Tim Allison added a comment - Added basic extraction. We're currently skipping formula (<c:f>) elements, and we're not recording the full file paths, which can be stored for the underlying/source xlsx file.

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              bammers Chris Bamford
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development