Hadoop Common
  1. Hadoop Common
  2. HADOOP-2951

contrib package provides a utility to build or update an index A contrib package to update an index using Map/Reduce

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None

      Description

      This contrib package provides a utility to build or update an index
      using Map/Reduce.

      A distributed "index" is partitioned into "shards". Each shard corresponds
      to a Lucene instance. org.apache.hadoop.contrib.index.main.UpdateIndex
      contains the main() method which uses a Map/Reduce job to analyze documents
      and update Lucene instances in parallel.

      The Map phase of the Map/Reduce job formats, analyzes and parses the input
      (in parallel), while the Reduce phase collects and applies the updates to
      each Lucene instance (again in parallel). The updates are applied using the
      local file system where a Reduce task runs and then copied back to HDFS.
      For example, if the updates caused a new Lucene segment to be created, the
      new segment would be created on the local file system first, and then
      copied back to HDFS.

      When the Map/Reduce job completes, a "new version" of the index is ready
      to be queried. It is important to note that the new version of the index
      is not derived from scratch. By leveraging Lucene's update algorithm, the
      new version of each Lucene instance will share as many files as possible
      as the previous version.

      The main() method in UpdateIndex requires the following information for
      updating the shards:

      • Input formatter. This specifies how to format the input documents.
      • Analysis. This defines the analyzer to use on the input. The analyzer
        determines whether a document is being inserted, updated, or deleted.
        For inserts or updates, the analyzer also converts each input document
        into a Lucene document.
      • Input paths. This provides the location(s) of updated documents,
        e.g., HDFS files or directories, or HBase tables.
      • Shard paths, or index path with the number of shards. Either specify
        the path for each shard, or specify an index path and the shards are
        the sub-directories of the index directory.
      • Output path. When the update to a shard is done, a message is put here.
      • Number of map tasks.

      All of the information can be specified in a configuration file. All but
      the first two can also be specified as command line options. Check out
      conf/index-config.xml.template for other configurable parameters.

      Note: Because of the parallel nature of Map/Reduce, the behaviour of
      multiple inserts, deletes or updates to the same document is undefined.

      1. contrib_index.tar.gz
        616 kB
        Ning Li
      2. contrib_index.tar.gz
        621 kB
        Ning Li
      3. contrib_index_javadoc.patch
        0.7 kB
        Ning Li

        Activity

        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #434 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/434/ )
        Hide
        Doug Cutting added a comment -

        I committed this. Thanks, Ning!

        Show
        Doug Cutting added a comment - I committed this. Thanks, Ning!
        Hide
        Ning Li added a comment -

        The new patch adds the Apache license and javadoc to the source files, and modifies the top-level build.xml to include this contrib package in the javadoc.

        Show
        Ning Li added a comment - The new patch adds the Apache license and javadoc to the source files, and modifies the top-level build.xml to include this contrib package in the javadoc.
        Hide
        Doug Cutting added a comment -

        > As to the top-level build.xml, I only need to change the javadoc target, right?

        Yes, that's right. Thanks!

        Show
        Doug Cutting added a comment - > As to the top-level build.xml, I only need to change the javadoc target, right? Yes, that's right. Thanks!
        Hide
        Ning Li added a comment -

        I'll add the Apache license and javadoc to the sources and submit a new patch.
        As to the top-level build.xml, I only need to change the javadoc target, right?

        Show
        Ning Li added a comment - I'll add the Apache license and javadoc to the sources and submit a new patch. As to the top-level build.xml, I only need to change the javadoc target, right?
        Hide
        Doug Cutting added a comment -

        +1 for including this in contrib.

        Some minor nits:

        • the source files lack the Apache license
        • the sources lack javadoc
        • we should modify the top-level build.xml to include this in the javadoc

        Other than that, this looks excellent! It has unit tests, examples, a good README, etc.

        Show
        Doug Cutting added a comment - +1 for including this in contrib. Some minor nits: the source files lack the Apache license the sources lack javadoc we should modify the top-level build.xml to include this in the javadoc Other than that, this looks excellent! It has unit tests, examples, a good README, etc.
        Hide
        Enis Soztutar added a comment -

        I have not examined the patch in sufficient detail, but it seems good. I think we can include this in the contrib directory unless anyone objects.

        Show
        Enis Soztutar added a comment - I have not examined the patch in sufficient detail, but it seems good. I think we can include this in the contrib directory unless anyone objects.

          People

          • Assignee:
            Doug Cutting
            Reporter:
            Ning Li
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development