Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
6.4.1
-
None
-
None
-
WIndows 7 professional 64bit (current patch/release)
Description
HTMLStripCharFilterFactory does not remove <script> sections from the <body> section of HTML document, but works fine in the <head> section.
NOTE (03/22/2017): This is occurring when when using ExtractingRequestHandler via a curl command:
e.g curl http://localhost:8983/solr/test_core/update/extract?literal.id=33 -F "myfile=@test_data/a_simple_html_page_jira.htm"
It will work correctly in the Analysis tab of the Solr Admin tool for the same configuration.
Fails remove <script> section content (removes tags, leaves content):
<body>
<script>
function myFunctionInsideBody() {
document.getElementById("demo_body").innerHTML = "Paragraph changed.";
}
</script>
...
</body>
Works - removes entire <script> section:
<head>
<script>
function myFunctionInsideHead() {
document.getElementById("demo_head").innerHTML = "Paragraph changed.";
}
</script>
...
</head>
–