[NUTCH-978] A Plugin for extracting certain element of a web page on html page parsing. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 1.2
Fix Version/s: None
Component/s: parser
Labels:
- gsoc2012
- mentor
Environment:

Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9

Description

Nutch use parse-html plugin to parse web pages, it process the contents of the web page by removing html tags and component like javascript and css and leaving the extracted text to be stored on the index. Nutch by default doesn't have the capability to select certain atomic element on an html page, like certain tags, certain content, some part of the page, etc.

A html page have a tree-like xml pattern with html tag as its branch and text as its node. This branch and node could be extracted using XPath. XPath allowing us to select a certain branch or node of an XML and therefore could be used to extract certain information and treat it differently based on its content and the user requirements. Furthermore a web domain like news website usually have a same html code structure for storing the information on its web pages. This same html code structure could be parsed using the same XPath query and retrieve the same content information element. All of the XPath query for selecting various content could be stored on a XPath Configuration File.

The purpose of nutch are for various web source, not all of the web page retrieved from those various source have the same html code structure, thus have to be threated differently using the correct XPath Configuration. The selection of the correct XPath configuration could be done automatically using regex by matching the url of the web page with valid url pattern for that xpath configuration.

This automatic mechanism allow the user of nutch to process various web page and get only certain information that user wants therefore making the index more accurate and its content more flexible.

The component for this idea have been tested on nutch 1.2 for selecting certain elements on various news website for the purpose of document clustering. This includes a Configuration Editor Application build using NetBeans 6.9 Application Framework. though its need a few debugging.

http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

app_screenshoot_configuration_result.png
06/Apr/11 20:06
200 kB
Ammar Shadiq
app_screenshoot_configuration_result_anchor.png
06/Apr/11 20:06
323 kB
Ammar Shadiq
app_screenshoot_source_view.png
06/Apr/11 20:06
157 kB
Ammar Shadiq
app_screenshoot_url_regex_filter.png
06/Apr/11 20:06
205 kB
Ammar Shadiq
[Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf
07/Apr/11 08:38
51 kB
Ammar Shadiq
app_guardian_ivory_coast_news_exmpl.png
08/Apr/11 22:36
199 kB
Ammar Shadiq
for_GSoc.zip
19/Feb/12 18:44
2.22 MB
Lewis John McGibbney
version_alpha2.zip
19/Mar/12 08:26
7.80 MB
Ammar Shadiq

Activity

People

Assignee:: Chris A. Mattmann

Reporter:: Ammar Shadiq

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 06/Apr/11 15:09

Updated:: 04/Oct/16 17:13

Time Tracking

Estimated:

1,680h

Remaining:

1,680h

Logged:

Not Specified