Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2025

Extraction of long sequences of digits from Excel spreadsheets using Tika 1.13 doesn’t yield the expected results

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None

      Description

      If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation.

      For example, the credit card number “340229177292566” is extracted from the attached spreadsheet as 3.40229E+14, which clearly is not the desired output.
      This works as expected in 1.12 and earlier. I suspect POI’s recent use of org.apache.poi.ss.usermodel.ExcelGeneralNumberFormat is to blame.

      I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison@apache.org Tim Allison
                Reporter:
                aeham.abushwashi Aeham Abushwashi
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: