Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1693

TextMD5Signature computed on textual content

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.3, 1.10
    • None
    • None
    • Patch Available

    Description

      I create a new MD5Signature that based on textual content. In our case we use boilerpipe to extract main text from content so this signature is more effective to deduplicate.

      Attachments

        1. NUTCH-1693-trunk.patch
          2 kB
          Markus Jelsma
        2. NUTCH-1693-trunk.patch
          2 kB
          Markus Jelsma
        3. NUTCH-1693-2x-v2.patch
          3 kB
          Sebastian Nagel
        4. NUTCH-1693.patch
          2 kB
          Tien Nguyen Manh

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              tiennm Tien Nguyen Manh
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: