Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3176

Add a removeRegion method in PDFTextSTripperByArea class

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.10, 1.8.11, 2.0.0
    • 1.8.11, 2.0.0
    • Text extraction
    • None
    • All

    Description

      Hi,

      I am parsing a very complicated PDF, for which I had to enable (setSortByPosition as true), otherwise the Parser is not able to do sequential text extraction.

      So I decided to use PDFTextStripperByArea class, and then make rectangles to extract text. But problem here is that If I make many rectangles in a single page, again there is no logical sequence of text extracted, So to get around this it will be awesome to have a method to remove regions, then we can add a region extract text, remove that region , then again add new region and so on....

      I have already done a POC in my local computer and it works fine. added this method and tested.

      public void removeRegion(String regionName) {
      this.regions.remove(regionName);
      this.regionArea.remove(regionName);
      }

      I can contribute this code myself, if you suggest, let me know, thanks and regards
      Praveer

      Attachments

        Activity

          People

            tilman Tilman Hausherr
            praveer Praveer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: