[NUTCH-1375] extract main content of a html file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.4
Fix Version/s: 1.8
Component/s: parser
Labels:
None

Patch Info:

Patch Available

Description

i write a code, that can extract main content of a html (usally weblogs).
this content usally apperas in <body><p> tag but there is no insurance. also might be multiple tags with form of <body><p> but only one of them is main content. this code first find body node, and then compute weight of childs nodes that compute based on text volume and height. so the code find lowest node that have maximum text volume.
i hope that improvement of this code cause to solutions to find fake or duplicated pages.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1375.patch
22/May/12 12:20
9 kB
behnam nikbakht

Issue Links

is related to

NUTCH-961 Expose Tika's boilerpipe support

Closed

Activity

People

Assignee:: Unassigned

Reporter:: behnam nikbakht

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/May/12 12:19

Updated:: 25/Aug/13 15:52

Resolved:: 25/Aug/13 15:52