[NUTCH-2770] Subcollection logic allows empty string as a whitelist value, thus matching every incoming document. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.16
Fix Version/s: 1.17
Component/s: indexer, plugin
Labels:
None

Description

If subcollections.xml whitelist element contains empty lines at the end (ie: because the XML was formatted nicely) those lines can become an empty string in the string matching logic. That logic uses String.contains, and that in turn returns TRUE for an empty string as input.

This then causes that subcollection to be tagged on EVERY incoming document.

Here is a POC to show the issue in isolation, since I do not yet have a dev environment setup for nutch yet.

/**
This is a snippet that does the same logic as Subcollection.java in nutch.
https://github.com/apache/nutch/blob/fdee94d8e0894384f1fca7c9f16c7593a5bc928c/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
**/

import java.lang.Math; 
import java.util.StringTokenizer;
public class HelloWorld
{
  public static void main(String[] args)
  {
    String urlToTest = "https://www.example.com/test/url/here";
    String text = "\r\n\t//research.xyz.com/\r\n\t/research/\r\n\t";
    StringTokenizer st = new StringTokenizer(text, "\n\r");
    while (st.hasMoreElements()) {
      String line = ((String) st.nextElement()).trim();
      boolean matched = urlToTest.contains(line);
      System.out.println("line: [" + line + "] = " + matched);
    }
  }
}


/**
output:
line: [//research.xyz.com/] = false
line: [/research/] = false
line: [] = true
as we can see, for the text in our XML config, it's outputting an extra line which is matching on EVERYTHING...
**/

There is a workaround, if you collapse the whitespace in the XML file, but I think we should fix this anyway. I will try to do so and submit a patch soon which will filter out empty string.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-2770.patch
27/Feb/20 23:22
0.6 kB
Jason Grey

Issue Links

links to

GitHub Pull Request #503

Activity

People

Assignee:: Sebastian Nagel

Reporter:: Jason Grey

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Feb/20 22:43

Updated:: 28/Jan/21 13:16

Resolved:: 13/Mar/20 08:33