Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3176

Add a removeRegion method in PDFTextSTripperByArea class

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.10, 1.8.11, 2.0.0
    • Fix Version/s: 1.8.11, 2.0.0
    • Component/s: Text extraction
    • Labels:
      None
    • Environment:
      All

      Description

      Hi,

      I am parsing a very complicated PDF, for which I had to enable (setSortByPosition as true), otherwise the Parser is not able to do sequential text extraction.

      So I decided to use PDFTextStripperByArea class, and then make rectangles to extract text. But problem here is that If I make many rectangles in a single page, again there is no logical sequence of text extracted, So to get around this it will be awesome to have a method to remove regions, then we can add a region extract text, remove that region , then again add new region and so on....

      I have already done a POC in my local computer and it works fine. added this method and tested.

      public void removeRegion(String regionName) {
      this.regions.remove(regionName);
      this.regionArea.remove(regionName);
      }

      I can contribute this code myself, if you suggest, let me know, thanks and regards
      Praveer

        Attachments

          Activity

            People

            • Assignee:
              tilman Tilman Hausherr
              Reporter:
              praveer Praveer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: