-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Cannot Reproduce
-
Affects Version/s: 6.4.1
-
Fix Version/s: None
-
Component/s: contrib - Solr Cell (Tika extraction)
-
Labels:None
-
Environment:
WIndows 7 professional 64bit (current patch/release)
HTMLStripCharFilterFactory does not remove <script> sections from the <body> section of HTML document, but works fine in the <head> section.
NOTE (03/22/2017): This is occurring when when using ExtractingRequestHandler via a curl command:
e.g curl http://localhost:8983/solr/test_core/update/extract?literal.id=33 -F "myfile=@test_data/a_simple_html_page_jira.htm"
It will work correctly in the Analysis tab of the Solr Admin tool for the same configuration.
Fails remove <script> section content (removes tags, leaves content):
<body>
<script>
function myFunctionInsideBody() {
document.getElementById("demo_body").innerHTML = "Paragraph changed.";
}
</script>
...
</body>
Works - removes entire <script> section:
<head>
<script>
function myFunctionInsideHead() {
document.getElementById("demo_head").innerHTML = "Paragraph changed.";
}
</script>
...
</head>
–