Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10337

HTMLStripCharFilterFactory does not seem to handle <script> section inside a <body> section

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 6.4.1
    • Fix Version/s: None
    • Labels:
      None
    • Environment:

      WIndows 7 professional 64bit (current patch/release)

      Description

      HTMLStripCharFilterFactory does not remove <script> sections from the <body> section of HTML document, but works fine in the <head> section.
      NOTE (03/22/2017): This is occurring when when using ExtractingRequestHandler via a curl command:
      e.g curl http://localhost:8983/solr/test_core/update/extract?literal.id=33 -F "myfile=@test_data/a_simple_html_page_jira.htm"
      It will work correctly in the Analysis tab of the Solr Admin tool for the same configuration.

      Fails remove <script> section content (removes tags, leaves content):
      <body>
      <script>
      function myFunctionInsideBody() {
      document.getElementById("demo_body").innerHTML = "Paragraph changed.";
      }
      </script>
      ...
      </body>

      Works - removes entire <script> section:
      <head>
      <script>
      function myFunctionInsideHead() {
      document.getElementById("demo_head").innerHTML = "Paragraph changed.";
      }
      </script>
      ...
      </head>

        Attachments

          Activity

            People

            • Assignee:
              sarowe Steven Rowe
              Reporter:
              nwbrad NW Brad
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: