Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3544

Extraction of long sequences of digits from Excel spreadsheets using Tika 1.20 doesn’t yield the expected results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.20
    • None
    • parser
    • None

    Description

      If an Excel spreadsheet contains a long sequence of digits, such as a credit card number, Tika 1.13 will emit the said sequence in scientific notation.

      For example, the credit card number “6011799905775830” is extracted from the attached spreadsheet as 6.480195344642784E15, which clearly is not the desired output.

      I think the impact of this issue is significant. There’s plenty of information that can no longer be reliably extracted from spreadsheets. Think credit card numbers, telephone numbers and product identifiers to name a few.

      Attachments

        1. Credit Card Numbers.xlsx
          489 kB
          Jitin Jindal

        Issue Links

          Activity

            People

              Unassigned Unassigned
              JJINDAL Jitin Jindal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: