Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1375

extract main content of a html file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.4
    • 1.8
    • parser
    • None
    • Patch Available

    Description

      i write a code, that can extract main content of a html (usally weblogs).
      this content usally apperas in <body><p> tag but there is no insurance. also might be multiple tags with form of <body><p> but only one of them is main content. this code first find body node, and then compute weight of childs nodes that compute based on text volume and height. so the code find lowest node that have maximum text volume.
      i hope that improvement of this code cause to solutions to find fake or duplicated pages.

      Attachments

        1. NUTCH-1375.patch
          9 kB
          behnam nikbakht

        Issue Links

          Activity

            People

              Unassigned Unassigned
              behnam.nikbakht behnam nikbakht
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: