Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-613

PDF parser is changing letters positions

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
    • Environment:

      running Tika inside VB.NET 2010 with IKVM

      Description

      The pdf parser is changing the position of some letters and adding spaces inside the text.

      For example:
      Parsed text
      "O fluox de caixa e os ganhos econmô icos referentes à estocagem dos RSD no aterro sanitário"

      Original
      "O fluxo de caixa e os ganhos econômicos referentes à estocagem dos RSD no aterro sanitário"

      I`ve parsed the same text with iTextsharp and the result was Ok.

      The original pdf file is here:
      http://www.teses.usp.br/teses/disponiveis/8/8135/tde-04072008-113118/publico/DISSERTACAO_JOSE_EDUARDO_ABBAS.pdf

      [UPDATE]

      It looks like the "changing positions" is solved in the new version (1.0), but there is some "spaces" between text:

      Parsed
      "Os processos econômicos e polít icos causadores da atual forma de geração dos RSD"

      Original
      "Os processos econômicos e políticos causadores da atual forma de geração dos RSD"

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lexmooze Alex
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: