PDFBox
  1. PDFBox
  2. PDFBOX-448

Columns in text not extracted separately

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.8.7, 2.0.0
    • Fix Version/s: 2.1.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      The paper that is attached to PDFBOX-80 has two columns of text, but the extracted text is not separated by column. Instead it combines the text in each column on each line.

      PDFTextStripper has a notion of columns and "articles / beads", but they are not being used with this file.

      1. WBPaper00003120.pdf
        407 kB
        Arun Rangarajan

        Activity

        Hide
        Arun Rangarajan added a comment -

        Thanks for the pdfbox project.

        I have attached a PDF on which the left and the right column contents are merged. I am using pdfbox ver. 1.2.0 and using the command line to extract text

        java org.apache.pdfbox.ExtractText

        Show
        Arun Rangarajan added a comment - Thanks for the pdfbox project. I have attached a PDF on which the left and the right column contents are merged. I am using pdfbox ver. 1.2.0 and using the command line to extract text java org.apache.pdfbox.ExtractText
        Hide
        Vincent Breitmoser added a comment -

        This issue appears to have been resolved somewhere along the way. The attached document parses correctly.

        Show
        Vincent Breitmoser added a comment - This issue appears to have been resolved somewhere along the way. The attached document parses correctly.
        Hide
        Oliver Kopp added a comment -

        The PDF does not cause any errors. However, the columns are NOT separated in version 1.6.0.

        As a first result, I would really like the columns being separated by two spaces.
        Afterwards, one could try to re-align the text into columns.

        Show
        Oliver Kopp added a comment - The PDF does not cause any errors. However, the columns are NOT separated in version 1.6.0. As a first result, I would really like the columns being separated by two spaces. Afterwards, one could try to re-align the text into columns.
        Hide
        Mel Martinez added a comment -

        By default PDFTextStripper has it's "shouldSeparateByBeads" attribute set to "true" which means that it will try to extract the text flowing from one column to another as contiguous text. Thus it will extract/render the text from column 1 first followed by the text for column 2.

        If you set that flag to 'false', the stripper will try to extract the beads in rendered order, 'rendering' the vertically correlated lines from each column side by side — i.e, in the same line.

        However the text extraction does not currently demark when the text in the line is no longer in the first bead and now coming from the 2nd. So currently it is not possible to tell which words in the line came from which column.

        The writePage() code detects a gap in a line of words and inserts the singleton WordSeparator object between words. When the text is 'rendered' it is replaced with the return value of the 'getWordSeparator()' method (which can be modified using the 'setWordSeparator(String)' method). It may be possible to do something similar with detecting the bead change.

        I.E. - if we detect that we just incremented the bead count since the last insert of a WordSeparator, we could also insert a 'BeadSeparator'. We could then similarly instrument the ability to customize what string is used to render the BeadSeparator (it would default to be an empty string to maintain the current behavior).

        I unfortunately do not have time to work on this myself right now. If someone else wants to run with this idea and try to implement it, that would be cool.

        For most users, the default behavior of 'shouldSeparateByBeads==true' accomplishes what is needed because it tries to keep the text logically contiguous. Are you sure this isn't what you want?

        Show
        Mel Martinez added a comment - By default PDFTextStripper has it's "shouldSeparateByBeads" attribute set to "true" which means that it will try to extract the text flowing from one column to another as contiguous text. Thus it will extract/render the text from column 1 first followed by the text for column 2. If you set that flag to 'false', the stripper will try to extract the beads in rendered order, 'rendering' the vertically correlated lines from each column side by side — i.e, in the same line. However the text extraction does not currently demark when the text in the line is no longer in the first bead and now coming from the 2nd. So currently it is not possible to tell which words in the line came from which column. The writePage() code detects a gap in a line of words and inserts the singleton WordSeparator object between words. When the text is 'rendered' it is replaced with the return value of the 'getWordSeparator()' method (which can be modified using the 'setWordSeparator(String)' method). It may be possible to do something similar with detecting the bead change. I.E. - if we detect that we just incremented the bead count since the last insert of a WordSeparator, we could also insert a 'BeadSeparator'. We could then similarly instrument the ability to customize what string is used to render the BeadSeparator (it would default to be an empty string to maintain the current behavior). I unfortunately do not have time to work on this myself right now. If someone else wants to run with this idea and try to implement it, that would be cool. For most users, the default behavior of 'shouldSeparateByBeads==true' accomplishes what is needed because it tries to keep the text logically contiguous. Are you sure this isn't what you want?
        Hide
        John Hewson added a comment -

        I can confirm that the text is still extracted incorrectly with 2.0.

        Show
        John Hewson added a comment - I can confirm that the text is still extracted incorrectly with 2.0.

          People

          • Assignee:
            Unassigned
            Reporter:
            Brian Carrier
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development