Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Implemented
    • Affects Version/s: 1.6.0
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:

      Description

      I wrote sample standalone application with 1.6 version for pdf reading. Parser giving ??? characters particular PDF, few of other PDFs are working fine.
      Is there any problem with PDF file, but i have checked with other vendor parsers it is giving proper text.I am getting these ??? characters from PDFBox only.

      1. aaa1.pdf
        88 kB
        Ravi Kumar

        Activity

        Ravi Kumar created issue -
        Hide
        Ravi Kumar added a comment -

        This file header are coming proper english text but description is coming ?? characters.

        Show
        Ravi Kumar added a comment - This file header are coming proper english text but description is coming ?? characters.
        Ravi Kumar made changes -
        Field Original Value New Value
        Attachment aaa1.pdf [ 12521812 ]
        Hide
        Ravi Kumar added a comment -

        And if i use Tika parser, chinese CJK characters are coming, but PDF doesn't contain any CJK characters.

        Show
        Ravi Kumar added a comment - And if i use Tika parser, chinese CJK characters are coming, but PDF doesn't contain any CJK characters.
        Hide
        Ravi Kumar added a comment -

        Is there any solution

        Show
        Ravi Kumar added a comment - Is there any solution
        Andreas Lehmkühler made changes -
        Labels ??? PDFBox textextraction
        John Hewson made changes -
        Component/s Text extraction [ 12312228 ]
        Component/s Parsing [ 12312226 ]
        Hide
        Tilman Hausherr added a comment -

        Whatever the problem was, it has been solved. The only "?" I get is for the (C), where Adobe reader returns nothing.

        Show
        Tilman Hausherr added a comment - Whatever the problem was, it has been solved. The only "?" I get is for the (C), where Adobe reader returns nothing.
        Tilman Hausherr made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Implemented [ 10 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Ravi Kumar
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development