Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-611

PDFParser mixes the text from separate columns

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9
    • 0.10
    • parser
    • None

    Description

      As reported on the dev list by Michael Schmitz :

      I don't think the current snapshot is parsing articles (pdfs with columns/beads) correctly. The text is not in the write order as it intermixes text from different beads. Try it on an academic paper. http://turing.cs.washington.edu/papers/acl08.pdf

      This can be fixed by changing the value of setSortByPosition to false, which is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been added as part of the commit rev 1029510, see https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787

      Ideally we could specify what value to set for these parameters via the Context object, but for the time being wouldn't it make sense to set setSortByPosition to the default value of false? I think that this would be the best option for most cases where docs have columns.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jnioche Julien Nioche
            jnioche Julien Nioche
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment