[NUTCH-466] Flexible segment format - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.0.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.

Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.

Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.

Example applications:

storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
storing pre-tokenized version of plain text for faster snippet generation
storing linguistically tagged text for sophisticated data mining
storing image thumbnails

etc, etc ...

I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ParseFilters.java
31/May/07 19:54
3 kB
Andrzej Bialecki
segmentparts.patch
31/May/07 18:40
29 kB
Andrzej Bialecki

Activity

People

Assignee:: Andrzej Bialecki

Reporter:: Andrzej Bialecki

Votes:: 2 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Apr/07 20:42

Updated:: 08/Jun/11 21:34

Resolved:: 01/Apr/11 14:35