Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-586

Text Extraction on Android

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.1.0
    • None
    • Text extraction
    • Windows XP + Eclipse + PDFBox sources

    Description

      Hi,

      I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the extraction is worst.

      Am I the only only one who think there is a regression in text extraction ?

      My code is like this :

         PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
          int numberOfPages = document.getNumberOfPages();
          resources = this.getResources();
        
        android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // ANDROID code here to get file
         resourceGlyphList = R.raw.glyphlist;
         InputStream rawResource = resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
         android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
         Properties properties = new Properties();
          properties.load(rawResource);
          		
         PDFTextStripper stripper = new PDFTextStripper(properties );
          		
        stripper.setStartPage(pageNumber );    //   1 or any other page
        stripper.setEndPage(pageNumber );   // same page as above
      
         String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
         android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text : "+s);
      

      Maybe I should use page.getContents().getStream() or stripper.getTextForRegion( "class1" ) or stripper.writeText(doc, outputStream)

      I want the text as a String, not as a newly created file....

      Attachments

        1. TestPDFBox.zip
          5.93 MB
          Bernard
        2. PDFBOX586-internals.txt
          15 kB
          Andreas Lehmkühler
        3. PDFBOX586-Eval.txt
          26 kB
          Andreas Lehmkühler
        4. PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt
          2 kB
          Andreas Lehmkühler
        5. internals.pdf
          43 kB
          Bernard
        6. Eval.pdf
          192 kB
          Bernard
        7. EncryptedFileTest_RC4.pdf
          43 kB
          Eddie B
        8. EncryptedFileTest_AES.pdf
          44 kB
          Eddie B
        9. ASEB-Camping_Car_ou_Bateau.pdf
          86 kB
          Bernard

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bsegonnes Bernard
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: