Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-613

PDF parser is changing letters positions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9
    • None
    • parser
    • running Tika inside VB.NET 2010 with IKVM

    Description

      The pdf parser is changing the position of some letters and adding spaces inside the text.

      For example:
      Parsed text
      "O fluox de caixa e os ganhos econmô icos referentes à estocagem dos RSD no aterro sanitário"

      Original
      "O fluxo de caixa e os ganhos econômicos referentes à estocagem dos RSD no aterro sanitário"

      I`ve parsed the same text with iTextsharp and the result was Ok.

      The original pdf file is here:
      http://www.teses.usp.br/teses/disponiveis/8/8135/tde-04072008-113118/publico/DISSERTACAO_JOSE_EDUARDO_ABBAS.pdf

      [UPDATE]

      It looks like the "changing positions" is solved in the new version (1.0), but there is some "spaces" between text:

      Parsed
      "Os processos econômicos e polít icos causadores da atual forma de geração dos RSD"

      Original
      "Os processos econômicos e políticos causadores da atual forma de geração dos RSD"

      Attachments

        Activity

          People

            Unassigned Unassigned
            lexmooze Alex
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: