Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2246

While indexing Turkish web pages, "Parse Aborted: Lexical error...." occurs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.0
    • 3.1, 4.0-ALPHA
    • modules/examples
    • None
    • New

    Description

      When I try to index Turkish page if there is a Turkish specific character in the HTML specific tag HTML parser gives "Parse Aborted: Lexical error.on ... line" error.
      For this case "<IMG SRC="../images/head.jpg" WIDTH=570 HEIGHT=47 BORDER=0 ALT="ş">" exception address "ş" character (which has 351 ascii value) as an error. OR ı character in title tag.
      <a title="(ııı)">

      Turkish character in the content do not create any problem.

      Attachments

        Activity

          People

            rcmuir Robert Muir
            selimnadi Selim Nadi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: