Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1507

Getting Issue at text reading

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Cannot Reproduce
    • 1.7.1
    • None
    • .NET, Text extraction
    • None
    • windows, runing pdfbox in .Net using ikvm-7.2.4630.5 conversion , we are actually converting pdf into ALTO file

    Description

      <?xml version="1.0" encoding="UTF-8"?><alto xmlns="http://www.loc.gov/standards/
      alto/alto-v2.0.xsd"><Description><MeasurementUnit>inch1200</MeasurementUnit></De
      scription><Layout>
      <Page>
      <PrintSpace>
      <TextBlock>
      <TextLine>
      Feb 04, 2013 8:40:03 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
      WARNING: java.lang.NullPointerException
      java.lang.NullPointerException
      at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
      ipper.java:954)
      at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
      gine.java:498)
      at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
      ava:62)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
      e.java:556)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:271)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:237)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
      java:218)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.processDocumen
      ts(PrintWordLocation.cs:185)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.Main(PrintWord
      Location.cs:228)
      at cli.System.AppDomain._nExecuteAssembly(Unknown Source)
      at cli.System.AppDomain.ExecuteAssembly(Unknown Source)
      at cli.Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly(U
      nknown Source)

      Feb 04, 2013 8:40:03 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
      WARNING: java.lang.NullPointerException
      java.lang.NullPointerException
      at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
      ipper.java:954)
      at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
      gine.java:498)
      at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
      ava:62)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
      e.java:556)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:271)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:237)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
      java:218)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.processDocumen
      ts(PrintWordLocation.cs:185)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.Main(PrintWord
      Location.cs:228)
      at cli.System.AppDomain._nExecuteAssembly(Unknown Source)
      at cli.System.AppDomain.ExecuteAssembly(Unknown Source)
      at cli.Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly(U
      nknown Source)

      Feb 04, 2013 8:40:03 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
      WARNING: java.lang.NullPointerException
      java.lang.NullPointerException
      at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
      ipper.java:954)
      at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
      gine.java:498)
      at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
      ava:62)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
      e.java:556)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:271)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:237)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
      java:218)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.processDocumen
      ts(PrintWordLocation.cs:185)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.Main(PrintWord
      Location.cs:228)
      at cli.System.AppDomain._nExecuteAssembly(Unknown Source)
      at cli.System.AppDomain.ExecuteAssembly(Unknown Source)
      at cli.Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly(U
      nknown Source)

      Feb 04, 2013 8:40:03 PM org.apache.pdfbox.util.PDFStreamEngine processOperator
      WARNING: java.lang.NullPointerException
      java.lang.NullPointerException
      at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
      ipper.java:954)
      at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
      gine.java:498)
      at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
      ava:62)
      at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
      e.java:556)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:271)
      at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
      ne.java:237)
      at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
      java:218)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.processDocumen
      ts(PrintWordLocation.cs:185)
      at cli.org.apache.pdfbox.examples.util.PrintWordLocations.Main(PrintWord
      Location.cs:228)
      at cli.System.AppDomain._nExecuteAssembly(Unknown Source)
      at cli.System.AppDomain.ExecuteAssembly(Unknown Source)
      at cli.Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly(U
      nknown Source)

      </TextLine>
      </TextBlock>
      </PrintSpace>
      </Page>

      We have converted Java code in C# from https://github.com/cokernel/pdf2alto

      Attachments

        1. Pdf2Text.zip
          9.46 MB
          Tanmay Mandal

        Activity

          People

            Unassigned Unassigned
            tanmay.mandal@dreamztech.com Tanmay Mandal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified