Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2381

In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

    XMLWordPrintableJSON

    Details

    • Patch Info:
      Patch Available
    • Flags:
      Patch
    • Docs Text:
      Hide
      My suggestion is change the methods compare of TokenComparator to:

        private static class TokenComparator implements Comparator<Token> {
          public int compare(Token t1, Token t2) {
                 
            if (t2.cnt != t1.cnt) return t2.cnt - t1.cnt;
            return t2.val.compareTo(t1.val);

          }
        }

      This way allows order by token frequency/name.
      Show
      My suggestion is change the methods compare of TokenComparator to:   private static class TokenComparator implements Comparator<Token> {     public int compare(Token t1, Token t2) {                   if (t2.cnt != t1.cnt) return t2.cnt - t1.cnt;       return t2.val.compareTo(t1.val);     }   } This way allows order by token frequency/name.

      Description

      In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

      The method TextProfileSignature.calculate uses a HashMap to salve the tokens, after some process, the tokens come sorted by decreasing frequency.

      For some pages like "http://curia.europa.eu/jcms/" the text "profile" is the same but the signature come different for each fetch.

      Its happens because the tokens are sorted only by decreasing frequency. Tokens with the same frequency maybe not have the same order in different fetchs.

      The HashMap no guarantees as to the order of the map and not guarantee that the order will remain constant over time.

      My suggestion is change the methods TokenComparator.compare in order to sort by frequency and Name.

      Rodrigo

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                rodrigo.sestari Rodrigo Joni Sestari
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: