Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-189

Text extraction from Excel files juxtaposes cells

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.3
    • 0.3
    • general
    • None
    • Tika revision is svn-20090116, platform is Windows XP Pro SP3, JDK version is 1.6.0_06.

    Description

      I plan on using Tika to extract text from Excel (both .xls and .xlsx) files for indexing. But, I found that Tika juxtaposes cells on output. The example worksheets are in the attached .zip file.
      I took the time to run Apache POI and it does not have this bug i.e. cells are properly separated.

      When I run

      -begin-
      java -jar tika-0.3-SNAPSHOT-standalone.jar --text no_cell_separators_when_extracted.xls
      -end-

      I get the following output:

      -begin-
      Plan1
      NameEmailSanta Claussanta@claus.org
      Tooth Fairytooth@fairy.org
      -end-

      Same thing with a .xlxs file:
      -begin-
      java -jar tika-0.3-SNAPSHOT-standalone.jar --text no_cell_separators_when_extracted.xlsx
      -end-

      The output is:

      -begin-
      [Content_Types].xml

      _rels/.rels

      xl/_rels/workbook.xml.rels

      xl/workbook.xml

      xl/theme/theme1.xml

      xl/worksheets/_rels/sheet1.xml.rels

      xl/worksheets/sheet2.xml

      xl/worksheets/sheet3.xml

      xl/sharedStrings.xml
      NameEmailSanta Claussanta@claus.orgTooth Fairytooth@fairy.org

      xl/styles.xml

      xl/worksheets/sheet1.xml
      012345

      docProps/core.xml
      GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z

      docProps/app.xml
      Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
      -end-

      Also note that the values from docProps/app.xml have been juxtaposed as well.

      This way, after indexing these files using the output from Tika, a search engine will only find "Fairy" when substring matching is used, because "Tooth Fairy" becomes "Tooth Fairytooth@fairy.org". This is suboptimal and wrong.

      Thanks for your attention. Best regards,

      Georger

      Attachments

        1. no_cell_separators_when_extracted.zip
          10 kB
          Georger Araújo
        2. TIKA-189.patch
          0.1 kB
          Uwe Schindler
        3. TIKA-189.patch
          0.8 kB
          Uwe Schindler

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jukkaz Jukka Zitting
            georger_br Georger Araújo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment