Nutch
  1. Nutch
  2. NUTCH-193

move NDFS and MapReduce to a separate project

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8
    • Fix Version/s: 0.8
    • Component/s: None
    • Labels:
      None

      Description

      The NDFS and MapReduce code should move from Nutch to a new Lucene sub-project named Hadoop.

      My plan is to do this as follows:

      1. Move all code in the following packages from Nutch to Hadoop:

      org.apache.nutch.fs
      org.apache.nutch.io
      org.apache.nutch.ipc
      org.apache.nutch.mapred
      org.apache.nutch.ndfs

      These packages will all be renamed to org.apache.hadoop, and Nutch code will be updated to reflect this.

      2. Move selected classes from Nutch to Hadoop, as follows:

      org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
      org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable
      org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured

      org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
      org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
      org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon

      3. Add a jar containing all of the above the Nutch's lib directory.

      Does this plan sound reasonable?

        Issue Links

          Activity

          Hide
          Sami Siren added a comment -

          closing issues for released versions

          Show
          Sami Siren added a comment - closing issues for released versions
          Hide
          Mike Cafarella added a comment -

          It should be noted that the name "Nutch" also comes from one of Doug's children.
          They seem to have a proud future in advertising and product naming.

          Show
          Mike Cafarella added a comment - It should be noted that the name "Nutch" also comes from one of Doug's children. They seem to have a proud future in advertising and product naming.
          Hide
          Doug Cutting added a comment -

          I just committed this. Phew!

          Show
          Doug Cutting added a comment - I just committed this. Phew!
          Hide
          Doug Cutting added a comment -

          Okay, I've moved the code from Nutch to Hadoop. Now I need to repair Nutch so that it still works!

          One remaining problem is the need to separate nutch config files from hadoop config files. There's now a hadoop-default.xml and hadoop-site.xml, which are separate from the similarly-named nutch files. For now, I'll fix this by adding the following methods to Hadoop's Configuration class:

          void addDefaultResource(String name);
          void addFinalResource(String name);

          Then add a Nutch utility class like:

          public class NutchConfiguration {
          public static Configuration create()

          { Configuration conf = new Configuration(); addNutchResources(conf); }

          public static Configuration addNutchResources(Configuration conf)

          { addDefaultResource("nutch-default.xml"); addFinalResource("nutch-site.xml"); }

          }

          Then all of the places which currently call 'new NutchConf()' can be replaced with 'NutchConfiguration().create()'.

          Longer-term we might consider a more radical re-design of the configuration API. But first we need to get Hadoop and Nutch split.

          Show
          Doug Cutting added a comment - Okay, I've moved the code from Nutch to Hadoop. Now I need to repair Nutch so that it still works! One remaining problem is the need to separate nutch config files from hadoop config files. There's now a hadoop-default.xml and hadoop-site.xml, which are separate from the similarly-named nutch files. For now, I'll fix this by adding the following methods to Hadoop's Configuration class: void addDefaultResource(String name); void addFinalResource(String name); Then add a Nutch utility class like: public class NutchConfiguration { public static Configuration create() { Configuration conf = new Configuration(); addNutchResources(conf); } public static Configuration addNutchResources(Configuration conf) { addDefaultResource("nutch-default.xml"); addFinalResource("nutch-site.xml"); } } Then all of the places which currently call 'new NutchConf()' can be replaced with 'NutchConfiguration().create()'. Longer-term we might consider a more radical re-design of the configuration API. But first we need to get Hadoop and Nutch split.
          Hide
          Doug Cutting added a comment -

          The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term.

          Show
          Doug Cutting added a comment - The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term.
          Hide
          John Xing added a comment -

          what's in the name hadoop? Because "had oops"?

          Show
          John Xing added a comment - what's in the name hadoop? Because "had oops"?
          Hide
          Doug Cutting added a comment -

          Otis: yes, thanks, I meant org.apache.hadoop.dfs.

          Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today. I'll then try to make the split tomorrow.

          Show
          Doug Cutting added a comment - Otis: yes, thanks, I meant org.apache.hadoop.dfs. Andrzej: I'm awaiting Mike's commit of NUTCH-183 , which should happen today. I'll then try to make the split tomorrow.
          Hide
          Andrzej Bialecki added a comment -

          Ok, the sooner the better from my POV. I didn;t have anything in mind that would be included in Hadoop, rather Nutch patches that I'm working on. Affected patches include some of the recent larger ones: the adaptive fetch schedule thing and crawl metadata. No big deal, but we need to know what to shoot for.

          Show
          Andrzej Bialecki added a comment - Ok, the sooner the better from my POV. I didn;t have anything in mind that would be included in Hadoop, rather Nutch patches that I'm working on. Affected patches include some of the recent larger ones: the adaptive fetch schedule thing and crawl metadata. No big deal, but we need to know what to shoot for.
          Hide
          Otis Gospodnetic added a comment -

          I assume Doug meant org.apache.hadoop.dfs, not org.apache.nutch.dfs.

          Show
          Otis Gospodnetic added a comment - I assume Doug meant org.apache.hadoop.dfs, not org.apache.nutch.dfs.
          Hide
          Doug Cutting added a comment -

          Andrzej: I'd like to do this soon, this week or next. No matter how long I wait, there will probably always be a few patches queued that will need to be updated. But hopefully we can avoid large patches like NUTCH-169. What other patches are you concerned about in particular?

          Sami: yes, the fuse stuff would then make a great hadoop contrib package.

          Show
          Doug Cutting added a comment - Andrzej: I'd like to do this soon, this week or next. No matter how long I wait, there will probably always be a few patches queued that will need to be updated. But hopefully we can avoid large patches like NUTCH-169 . What other patches are you concerned about in particular? Sami: yes, the fuse stuff would then make a great hadoop contrib package.
          Hide
          Sami Siren added a comment -

          +1

          I quess the fuse-j - ndfs work from John/me could be part of hadoop /contrib after this change?

          Show
          Sami Siren added a comment - +1 I quess the fuse-j - ndfs work from John/me could be part of hadoop /contrib after this change?
          Hide
          Andrzej Bialecki added a comment -

          What timeframe did you have in mind? There are a few patches in the queue, which will be affected by this split.

          Other than that - emphatic yes!

          Show
          Andrzej Bialecki added a comment - What timeframe did you have in mind? There are a few patches in the queue, which will be affected by this split. Other than that - emphatic yes!
          Hide
          Doug Cutting added a comment -

          NDFS, the Nutch Distributed Filesystem will be renamed HDFS, the Hadoop Distributed Filesystem. Its code will live in the package org.apache.nutch.dfs, and its fs implementation class will be named DistributedFileSystem.

          Show
          Doug Cutting added a comment - NDFS, the Nutch Distributed Filesystem will be renamed HDFS, the Hadoop Distributed Filesystem. Its code will live in the package org.apache.nutch.dfs, and its fs implementation class will be named DistributedFileSystem.
          Hide
          Doug Cutting added a comment -

          Link to the related Nutch issue.

          Show
          Doug Cutting added a comment - Link to the related Nutch issue.

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Doug Cutting
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development