Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2646

Tika parse["content"] returns jumbled text across cells of a table in a pdf

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Won't Fix
    • 1.18
    • None
    • parser
    • MacOS Sierra 10.12.6

    Description

      When text from a table is extracted, sometimes the order of the cells becomes mixed and the words get concatenated together. For example:

       

      HOURS DUR
      (hr)
      PHASE CODE SUB DESCRIPTION

      becomes: Hours Dur Code Sub DescriptionPhase

       

      In other more serious cases, the text within a cell becomes scrambled with a text from another cell. Such as:

      HOURS DUR
      (hr)
      PHASE CODE SUB
      00:00 - 17:00 17.00 FLOWBK 33 P - FLOWBACK /
      TESTING
      E - RIG OUT
      TESTERS

      the second row becomes:

      17.00-00:00 17:00 FLOWBK E - RIG OUT

       

      TESTERS

       

      33 P -

       

      FLOWBACK /

       

      TESTING

      Note that the value of the second column has moved to the first column, and the "-" within the first column is misordered. The last two columns have switched places.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adidier Annie Didier
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: