Issue Details (XML | Word | Printable)

Key: NUTCH-520
Type: Improvement Improvement
Status: Closed Closed
Resolution: Duplicate
Priority: Major Major
Assignee: Unassigned
Reporter: Doğacan Güney
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

A common infrastructure for different index backends

Created: 19/Jul/07 08:47 AM   Updated: 31/Jul/07 01:20 PM
Return to search
Component/s: indexer
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works RFC_multiple_index_backends.patch 2007-07-19 09:25 AM Doğacan Güney 67 kB
Text File Licensed for inclusion in ASF works RFC_multiple_index_backends_v2.patch 2007-07-20 12:09 PM Doğacan Güney 68 kB
Text File Licensed for inclusion in ASF works RFC_multiple_index_backends_v3.patch 2007-07-23 10:57 AM Doğacan Güney 71 kB

Resolution Date: 31/Jul/07 01:20 PM


 Description  « Hide
With the discussion of solr as a possible index and search backend, I think we need a new indexing architecture (that doesn't depend on lucene) that can use multiple backends to index.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doğacan Güney added a comment - 19/Jul/07 09:25 AM - edited
Here is my proposal on how we can do it along with a patch:

i) Add a NutchDocument class:

A NutchDocument contains a mapping from String-s to List<String>-s as fields, a metadata (to be explained later) and score. NutchDocument fields doesn't contain any information about how it is meant to be indexed or stored (not entirely true, explained later). These options are missing because different backends may not represent the same options. For example, solr doesn't (AFAIK) allow you to change how a field is stored at runtime. Also, one may want to index to a MySQL database (I don't know why, but it is possible), which again doesn't provide storage or indexing options.

ii) Add a NutchIndexWriter interface:

NutchIndexWriter is the interface to be implemented if you want to add another indexing backend to nutch. A NutchIndexWriter writes, not-so-surprisingly, NutchDocument-s. Implementations are meant to take the NutchDocument, convert it into their internal format and then write the converted data. This patch adds two NutchIndexWriter-s: LuceneWriter and SolrWriter.

Also, Indexer.OutputFormat is updated to use NutchIndexWriter instead of lucene's index writer. After this patch, it is possible to index to more than one backend simultaneously. Indexer is now used like this:

bin/nutch index -lucene crawl/indexes -solr "http://...." crawl/crawldb crawl/linkdb crawl/segments...

You can use either lucene or solr backend or both.

iii) Allow indexing filters to define index-level and document-level metadata:

NutchDocument fields are simple key/value pairs and LuceneWriter can't determine how to store/index them by just looking at the fields. There are two ways to pass data to index backends:

1) Through configuration: Options specified in configuration are meant to be valid for all documents. A new method "addIndexBackendOptions" is added to IndexingFilter. This is used by indexing filters to add 'hints' to index backends.

For example, index-basic plugin calls:

LuceneWriter.addFieldOptions("title", LuceneWriter.STORE.YES, LuceneWriter.INDEX.TOKENIZED, conf);

This tells the lucene backend to store and tokenize title.

2) Document-level: Per-document free form string,string[] pairs. For example, if you normally want to store field "foo" in a lucene index, but you do not want to do it for a specific document, you can add a <"lucene.field.foo", "lucene.store.no"> pair to that document's metadata and LuceneWriter will not store field value of "foo" for that particular document.

Extra notes:

  • This patch is a very early draft. I am sure that a lot of stuff doesn't work. However, I tested indexing a 30000 url segment to both solr and lucene and didn't run into any problems. When only indexing to lucene, there is no noticable performance difference from earlier nutch versions.
  • NutchDocument has a add(Field) method for easy-upgrade of older indexing filters. However, it is slower compared and should only be used for upgrading.
  • I believe that this is a very important feature for nutch. (I don't know why I am writing this as a note)

Comments, suggestions, reviews and other feedback are welcome.

Edit: Updated to reflect the latest patch.


Doğacan Güney added a comment - 20/Jul/07 12:10 PM
New version. Mostly API cleanups.

Doğacan Güney added a comment - 23/Jul/07 10:57 AM
New version.
  • Index metadata is replaced with Configuration.
  • Some helper methods are added to make specifying field options easier. For example, this is how one can specify "url" field is stored/indexed:

LuceneWriter.addFieldOptions("url", STORE.YES, INDEX.TOKENIZED, conf);

Enums in LuceneWriter are written in all caps to avoid confusion with lucene's Field class. I guess we can give them a better name instead.

  • Some cleanups.

Doğacan Güney added a comment - 31/Jul/07 01:20 PM
I am closing this as duplicate since NUTCH-442 (which has a patch that includes latest patch here) encompasses this issue.