Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10337

HTMLStripCharFilterFactory does not seem to handle <script> section inside a <body> section

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 6.4.1
    • None
    • None
    • WIndows 7 professional 64bit (current patch/release)

    Description

      HTMLStripCharFilterFactory does not remove <script> sections from the <body> section of HTML document, but works fine in the <head> section.
      NOTE (03/22/2017): This is occurring when when using ExtractingRequestHandler via a curl command:
      e.g curl http://localhost:8983/solr/test_core/update/extract?literal.id=33 -F "myfile=@test_data/a_simple_html_page_jira.htm"
      It will work correctly in the Analysis tab of the Solr Admin tool for the same configuration.

      Fails remove <script> section content (removes tags, leaves content):
      <body>
      <script>
      function myFunctionInsideBody() {
      document.getElementById("demo_body").innerHTML = "Paragraph changed.";
      }
      </script>
      ...
      </body>

      Works - removes entire <script> section:
      <head>
      <script>
      function myFunctionInsideHead() {
      document.getElementById("demo_head").innerHTML = "Paragraph changed.";
      }
      </script>
      ...
      </head>

      Attachments

        Activity

          People

            sarowe Steven Rowe
            nwbrad NW Brad
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: