Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1400

Extract Excel (xls, xlsx) headers and footers

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      When I parser xls file,
      the headers's and footers's content can not be parsed.
      The xlsx file has the same problem.

      1. headers and footers.xls
        10 kB
        sunxingzhe
      2. SpreadsheetWithHeadersAndFooters.xls
        33 kB
        Aeham Abushwashi
      3. SpreadsheetWithHeadersAndFooters.xlsx
        11 kB
        Aeham Abushwashi
      4. TIKA-1400.patch
        63 kB
        Aeham Abushwashi

        Issue Links

          Activity

          Hide
          gagravarr Nick Burch added a comment -

          When you say "content can not be parsed", do you mean you're getting an error, an exception, no text, incorrect text, something else?

          Which headers/footers are you expecting to see? Print ones? On screen ones? Something else?

          Do you have a very small sample file with headers in that can be used to show the problem?

          Show
          gagravarr Nick Burch added a comment - When you say "content can not be parsed", do you mean you're getting an error, an exception, no text, incorrect text, something else? Which headers/footers are you expecting to see? Print ones? On screen ones? Something else? Do you have a very small sample file with headers in that can be used to show the problem?
          Hide
          sunxingzhe359 sunxingzhe added a comment -

          No errer occured. The txt file was produced.
          I attached a xls file.
          The file's headers has three cakes words and the footers has one cakes word.
          But the produced txt file has no cakes word.

          Show
          sunxingzhe359 sunxingzhe added a comment - No errer occured. The txt file was produced. I attached a xls file. The file's headers has three cakes words and the footers has one cakes word. But the produced txt file has no cakes word.
          Hide
          gagravarr Nick Burch added a comment -

          Ah, I see, it's the print headers/footers you're expecting that are missing

          This will need additional code in both the XLS and XSLX parsers, to find the print headers and footers, then extract them

          Show
          gagravarr Nick Burch added a comment - Ah, I see, it's the print headers/footers you're expecting that are missing This will need additional code in both the XLS and XSLX parsers, to find the print headers and footers, then extract them
          Hide
          sunxingzhe359 sunxingzhe added a comment - - edited

          Could you provide me the patch for the extraction.
          Thank you!

          Show
          sunxingzhe359 sunxingzhe added a comment - - edited Could you provide me the patch for the extraction. Thank you!
          Hide
          aeham.abushwashi Aeham Abushwashi added a comment - - edited

          I've attached a patch which includes the fix, a unit test to verify the fix for XLS files and another unit test to verify that header and footer extraction from XLSX files already works OK.
          The test data files are attached separately in case they can't be extracted out of the patch file.

          Show
          aeham.abushwashi Aeham Abushwashi added a comment - - edited I've attached a patch which includes the fix, a unit test to verify the fix for XLS files and another unit test to verify that header and footer extraction from XLSX files already works OK. The test data files are attached separately in case they can't be extracted out of the patch file.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          r1688834.. Thank you Aeham Abushwashi!

          Show
          tallison@mitre.org Tim Allison added a comment - r1688834.. Thank you Aeham Abushwashi !
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #779 (See https://builds.apache.org/job/tika-trunk-jdk1.7/779/)
          TIKA-1400 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688834)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/testEXCEL_headers_footers.xls
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/testEXCEL_headers_footers.xlsx
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #779 (See https://builds.apache.org/job/tika-trunk-jdk1.7/779/ ) TIKA-1400 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1688834 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java /tika/trunk/tika-parsers/src/test/resources/test-documents/testEXCEL_headers_footers.xls /tika/trunk/tika-parsers/src/test/resources/test-documents/testEXCEL_headers_footers.xlsx

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              sunxingzhe359 sunxingzhe
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development