Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.8.13
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      i converted this pdf from the attached word document "DummyDoc.docx"

      then when using pdfbox1.8 to extract text
      java -jar pdfbox-app-1.8.13.jar ExtractText "DummyDoc.pdf" txt.txt

      and the generated is

      Dummy document for tag extraction

      Section 1


      DummyTagOne_01
      This is text body one


      DummyTagOne_02
      This is text body two

      Section 2
      DummyTagTwo_01
      This is text body three


      DummyTagTwo_02
      This is text body four


      DummyTagTwo_03
      This is text body five

      as you can see "This is text body one " instead of "This is text body one " and so on

        Attachments

        1. DummyDoc.pdf
          16 kB
          Ahmed Eltayeb
        2. DummyDoc.docx
          17 kB
          Ahmed Eltayeb

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aeltayeb Ahmed Eltayeb
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: