Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-140

HTML parser unable to extract text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.2
    • 0.2
    • parser
    • None

    Description

      At revision 648732

      The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

      Attachments

        1. 1.html
          54 kB
          Julien Nioche
        2. anynamespace.diff
          1 kB
          Julien Nioche

        Activity

          People

            jukkaz Jukka Zitting
            jnioche Julien Nioche
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: