Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2650

Soft-hyphen is not extracted properly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 1.18
    • None
    • app
    • None
    • Important

    Description

      We are tring to extract text from PDF. if PDF having any big word at the end of line then after half word there is soft hyphen and remaining word goes to next line. but which extracting these text TIKA automatically replace hyphen with space.  

       

       

      Attachments

        1. document_example_w_sort.txt
          43 kB
          Yauheni Salopiy
        2. document_example_wo_sort.txt
          43 kB
          Yauheni Salopiy
        3. document_example.pdf
          139 kB
          Yauheni Salopiy
        4. document_example.txt
          45 kB
          Yauheni Salopiy
        5. output.txt
          5 kB
          Saurabh Patil
        6. Peter Rabbit.pdf
          3.12 MB
          Saurabh Patil

        Activity

          People

            Unassigned Unassigned
            saurabh.patil@clariontechnologies.co.in Saurabh Patil
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: