Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5411

PDFTextStripper could use text size in reconstruction

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.0.25, 3.0.0 PDFBox
    • None
    • Text extraction
    • None

    Description

      When two texts are partially overlapping PDFTextStripper seems to return a mix simply based on "leftmost x coordinate of the glyph", which makes sense, but it could make use of glyph size to disambiguate "easy" cases like this one:

      currently this is the first parameter of PDFTextStripper.writeString(String string, List<TextPosition> textPositions):

      "T0510E09620_S368b3aT92-29fa -4Leef-80I5e-N53c23efE7979f"

      I would of course hope for two calls:

      "TEST LINE"
      "051009620_368b3a92-29fa-4eef-805e-53c23ef7979f"

      Attachments

        1. textDoubleText.pdf
          1 kB
          Lapo Luchini
        2. image-2022-04-15-09-26-20-917.png
          3 kB
          Lapo Luchini
        3. image-2022-04-08-16-13-17-334.png
          39 kB
          Lapo Luchini

        Activity

          People

            Unassigned Unassigned
            lapo@lapo.it Lapo Luchini
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: