Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2025

Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None

      Description

      If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation.

      For example, the credit card number “340229177292566” is extracted from the attached spreadsheet as 3.40229E+14, which clearly is not the desired output.
      This works as expected in 1.12 and earlier. I suspect POI’s recent use of org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.

      I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few.

      1. Credit Card Numbers.xlsx
        481 kB
        Aeham Abushwashi

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, this was caused by a patch that makes POI more closely reflect Excel's spec in org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat.

          poi-58471 and rev1706971

          Show
          tallison@mitre.org Tim Allison added a comment - Y, this was caused by a patch that makes POI more closely reflect Excel's spec in org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat. poi-58471 and rev1706971
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          If an Excel spreadsheet contains a long sequence of digits, such as a credit card number,

          Not quite. As Javen O'Neal pointed out, you can't store more than 15 digit numbers as numerals in Excel. So, you'd never be able to store a 16 digit credit card number as a number; it would be stored as text, and then this wouldn't be a problem.

          The issue/change of behavior still holds for <16 digit numbers, and we need to find a workaround.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Not quite. As Javen O'Neal pointed out, you can't store more than 15 digit numbers as numerals in Excel. So, you'd never be able to store a 16 digit credit card number as a number; it would be stored as text, and then this wouldn't be a problem. The issue/change of behavior still holds for <16 digit numbers, and we need to find a workaround.
          Hide
          gagravarr Nick Burch added a comment -

          We could always test the formatted value for E+ (or E-?) on the end, and re-do with our own no-exponent formatter for that case. Would need a note about why we're doing it though!

          Show
          gagravarr Nick Burch added a comment - We could always test the formatted value for E+ (or E- ?) on the end, and re-do with our own no-exponent formatter for that case. Would need a note about why we're doing it though!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, I'm wondering about a narrower fix...perhaps go back to the old behavior only if the format is "General".

          Show
          tallison@mitre.org Tim Allison added a comment - Y, I'm wondering about a narrower fix...perhaps go back to the old behavior only if the format is "General".
          Hide
          aeham.abushwashi Aeham Abushwashi added a comment - - edited

          Thanks for looking into this.
          Just wanted to add that 'payment cards' can have anywhere between 12 and 19 digits. e.g. American Express uses 15 digits

          Show
          aeham.abushwashi Aeham Abushwashi added a comment - - edited Thanks for looking into this. Just wanted to add that 'payment cards' can have anywhere between 12 and 19 digits. e.g. American Express uses 15 digits
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for the ping. Are you able to view numbers with > 15 digits in Excel when the value is a number? I can only see numbers with > 15 digits when the cell is "Text".

          For example, in my version of Excel, when I type a 17 digit integer, e.g. 12345678901234567, the value that is stored is: 12345678901234500. If I prepend a ', to tell Excel to treat it as text, the value is correctly stored as text: 12345678901234567.

          The above is personal curiosity. We still need to fix this.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for the ping. Are you able to view numbers with > 15 digits in Excel when the value is a number? I can only see numbers with > 15 digits when the cell is "Text". For example, in my version of Excel, when I type a 17 digit integer, e.g. 12345678901234567, the value that is stored is: 12345678901234500. If I prepend a ', to tell Excel to treat it as text, the value is correctly stored as text: 12345678901234567. The above is personal curiosity. We still need to fix this.
          Hide
          aeham.abushwashi Aeham Abushwashi added a comment -

          I'm getting the same behaviour as you. i.e. 12345678901234567 is 'floored' to 12345678901234500

          Show
          aeham.abushwashi Aeham Abushwashi added a comment - I'm getting the same behaviour as you. i.e. 12345678901234567 is 'floored' to 12345678901234500
          Hide
          aeham.abushwashi Aeham Abushwashi added a comment -

          Is this likely to be fixed in 1.14?

          Show
          aeham.abushwashi Aeham Abushwashi added a comment - Is this likely to be fixed in 1.14?
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Up to 15 digits are now extracted for numbers with "General" format contrary to the MS spec. After 15, we use scientific notation with more significant digits that we had before.

                  assertContains("123456789012345", xml);//15 digit number
                  assertContains("123456789012346", xml);//15 digit formula
                  assertContains("1.23456789012345E+15", xml);//16 digit number is treated as scientific notation
                  assertContains("1.23456789012345E+15", xml);//16 digit formula, ditto
          

          Thank you, Aeham Abushwashi for noticing this, opening the issue and pointing me in the right direction within POI!

          Apologies for my delay...I thought I'd have to modify POI, but I found a way to do this at the Tika level.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Up to 15 digits are now extracted for numbers with "General" format contrary to the MS spec. After 15, we use scientific notation with more significant digits that we had before. assertContains("123456789012345", xml);//15 digit number assertContains("123456789012346", xml);//15 digit formula assertContains("1.23456789012345E+15", xml);//16 digit number is treated as scientific notation assertContains("1.23456789012345E+15", xml);//16 digit formula, ditto Thank you, Aeham Abushwashi for noticing this, opening the issue and pointing me in the right direction within POI! Apologies for my delay...I thought I'd have to modify POI, but I found a way to do this at the Tika level.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-2.x-windows #26 (See https://builds.apache.org/job/tika-2.x-windows/26/)
          TIKA-2025 increase number of significant digits extracted in "general" (tallison: rev f4bacf859650abbe438d7e19d6c0abdcd72a5b34)

          • tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx
          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java
          • tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xls
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java
          • CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #26 (See https://builds.apache.org/job/tika-2.x-windows/26/ ) TIKA-2025 increase number of significant digits extracted in "general" (tallison: rev f4bacf859650abbe438d7e19d6c0abdcd72a5b34) tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xls tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-2.x #122 (See https://builds.apache.org/job/tika-2.x/122/)
          TIKA-2025 increase number of significant digits extracted in "general" (tallison: rev f4bacf859650abbe438d7e19d6c0abdcd72a5b34)

          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • CHANGES.txt
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java
          • tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx
          • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java
          • tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xls
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
          • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-2.x #122 (See https://builds.apache.org/job/tika-2.x/122/ ) TIKA-2025 increase number of significant digits extracted in "general" (tallison: rev f4bacf859650abbe438d7e19d6c0abdcd72a5b34) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java CHANGES.txt tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java tika-test-resources/src/test/resources/test-documents/testEXCEL_big_numbers.xls tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Tika-trunk #1083 (See https://builds.apache.org/job/Tika-trunk/1083/)
          TIKA-2025 – override general format in excel to extract 15 digit (tallison: rev a383567c2c947603c4c7aa12d3578d771bb58413)

          • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java
          • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java
          • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • tika-parsers/src/test/resources/test-documents/testEXCEL_big_numbers.xls
          • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java
          • tika-parsers/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx
          • CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Tika-trunk #1083 (See https://builds.apache.org/job/Tika-trunk/1083/ ) TIKA-2025 – override general format in excel to extract 15 digit (tallison: rev a383567c2c947603c4c7aa12d3578d771bb58413) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TikaExcelDataFormatter.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java tika-parsers/src/test/resources/test-documents/testEXCEL_big_numbers.xls tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TikaExcelGeneralFormat.java tika-parsers/src/test/resources/test-documents/testEXCEL_big_numbers.xlsx CHANGES.txt
          Hide
          aeham.abushwashi Aeham Abushwashi added a comment -

          Thank you very much Tim!

          I’ve tested the change with my example file and it works beautifully for both 15- and 16-digit credit card numbers.
          Out of interest, what’s the difference between my 16-digit numbers and the numbers you’ve used in the unit test data file? I imagine it’s cell formatting but I thought it’s not possible to format a 16-digit sequence as a number in Excel.

          Show
          aeham.abushwashi Aeham Abushwashi added a comment - Thank you very much Tim! I’ve tested the change with my example file and it works beautifully for both 15- and 16-digit credit card numbers. Out of interest, what’s the difference between my 16-digit numbers and the numbers you’ve used in the unit test data file? I imagine it’s cell formatting but I thought it’s not possible to format a 16-digit sequence as a number in Excel.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          16 digit? Is the last digit 0? If so, that might be truncation/flooring. If not, are you sure it is stored as a number, perhaps formatted as text?

          Show
          tallison@mitre.org Tim Allison added a comment - - edited 16 digit? Is the last digit 0? If so, that might be truncation/flooring. If not, are you sure it is stored as a number, perhaps formatted as text?
          Hide
          aeham.abushwashi Aeham Abushwashi added a comment -

          You're right, they were formatted as text and I hadn't factored in the truncation for numeric cells with large values.
          Thanks

          Show
          aeham.abushwashi Aeham Abushwashi added a comment - You're right, they were formatted as text and I hadn't factored in the truncation for numeric cells with large values. Thanks
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user vulpes8 opened a pull request:

          https://github.com/apache/tika/pull/151

          fix for TIKA-2025 contributed by vulpes8

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/vulpes8/tika fix/TIKA-2025

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/151.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #151


          commit 0c5d609e0175dffb93c1a325c9a872c5e6945eb0
          Author: Cataldo Mazzilli <cataldo@studiostorti.com>
          Date: 2017-02-02T14:32:42Z

          fix for TIKA-2025 contributed by vulpes8


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user vulpes8 opened a pull request: https://github.com/apache/tika/pull/151 fix for TIKA-2025 contributed by vulpes8 You can merge this pull request into a Git repository by running: $ git pull https://github.com/vulpes8/tika fix/ TIKA-2025 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/151.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #151 commit 0c5d609e0175dffb93c1a325c9a872c5e6945eb0 Author: Cataldo Mazzilli <cataldo@studiostorti.com> Date: 2017-02-02T14:32:42Z fix for TIKA-2025 contributed by vulpes8
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/151

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/151
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Test now uses user locale to format the expected string. Let us know if this doesn't work for you.

          Show
          tallison@mitre.org Tim Allison added a comment - Test now uses user locale to format the expected string. Let us know if this doesn't work for you.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1192 (See https://builds.apache.org/job/Tika-trunk/1192/)
          TIKA-2025 – fix xls/x testBigIntegersWGeneralFormat to work in multiple (tallison: rev 3c0cd647571d23665056de9adcaf5d58dc087fb9)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1192 (See https://builds.apache.org/job/Tika-trunk/1192/ ) TIKA-2025 – fix xls/x testBigIntegersWGeneralFormat to work in multiple (tallison: rev 3c0cd647571d23665056de9adcaf5d58dc087fb9) (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              aeham.abushwashi Aeham Abushwashi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development