Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-466

Flexible segment format

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 1.0.0
    • None
    • None
    • None

    Description

      In many situations it is necessary to store more data associated with pages than it's possible now with the current segment format. Quite often it's a binary data. There are two common workarounds for this: one is to use per-page metadata, either in Content or ParseData, the other is to use an external independent database using page ID-s as foreign keys.

      Currently segments can consist of the following predefined parts: content, crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I propose a third option, which is a natural extension of this existing segment format, i.e. to introduce the ability to add arbitrarily named segment "parts", with the only requirement that they should be MapFile-s that store Writable keys and values. Alternatively, we could define a SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios.

      Existing segment API and searcher API (NutchBean, DistributedSearch Client/Server) should be extended to handle such arbitrary parts.

      Example applications:

      • storing HTML previews of non-HTML pages, such as PDF, PS and Office documents
      • storing pre-tokenized version of plain text for faster snippet generation
      • storing linguistically tagged text for sophisticated data mining
      • storing image thumbnails

      etc, etc ...

      I'm going to prepare a patchset shortly. Any comments and suggestions are welcome.

      Attachments

        1. ParseFilters.java
          3 kB
          Andrzej Bialecki
        2. segmentparts.patch
          29 kB
          Andrzej Bialecki

        Activity

          People

            ab Andrzej Bialecki
            ab Andrzej Bialecki
            Votes:
            2 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: