Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-532

missing spaces in text extraction of BodyContentHandler

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.8
    • 0.8
    • None
    • None

    Description

      BodyContentHandler works fine to extract the text from pages,
      except this page:

      http://www.lucidimagination.com/developers/whitepapers/whats-new-solr-14

      there is a selection,
      the text returned by BodyContentHandler contains

      "...Country: *
      – Select a Country – United StatesCanadaArgentinaAustraliaBrazilChinaFranceGermanyIndiaIndonesiaItalyJapanMexicoRussiaSaudi"

      to have a space between the country names would be favourable.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              reinhard Reinhard Pötz
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: