Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-586

Text Extraction on Android

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.1.0
    • None
    • Text extraction
    • Windows XP + Eclipse + PDFBox sources

    Description

      Hi,

      I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the extraction is worst.

      Am I the only only one who think there is a regression in text extraction ?

      My code is like this :

         PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
          int numberOfPages = document.getNumberOfPages();
          resources = this.getResources();
        
        android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // ANDROID code here to get file
         resourceGlyphList = R.raw.glyphlist;
         InputStream rawResource = resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
         android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
         Properties properties = new Properties();
          properties.load(rawResource);
          		
         PDFTextStripper stripper = new PDFTextStripper(properties );
          		
        stripper.setStartPage(pageNumber );    //   1 or any other page
        stripper.setEndPage(pageNumber );   // same page as above
      
         String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
         android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text : "+s);
      

      Maybe I should use page.getContents().getStream() or stripper.getTextForRegion( "class1" ) or stripper.writeText(doc, outputStream)

      I want the text as a String, not as a newly created file....

      Attachments

        1. ASEB-Camping_Car_ou_Bateau.pdf
          86 kB
          Bernard
        2. EncryptedFileTest_AES.pdf
          44 kB
          Eddie B
        3. EncryptedFileTest_RC4.pdf
          43 kB
          Eddie B
        4. Eval.pdf
          192 kB
          Bernard
        5. internals.pdf
          43 kB
          Bernard
        6. PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt
          2 kB
          Andreas Lehmkühler
        7. PDFBOX586-Eval.txt
          26 kB
          Andreas Lehmkühler
        8. PDFBOX586-internals.txt
          15 kB
          Andreas Lehmkühler
        9. TestPDFBox.zip
          5.93 MB
          Bernard

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            bsegonnes Bernard
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment