Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2531

Unclear steps in Nutch2 Tutorial

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Auto Closed
    • None
    • 2.5
    • None
    • None

    Description

      I was trying to install Nutch based on this tutorial https://wiki.apache.org/nutch/Nutch2Tutorial:

       

      Issues I've found:

      In Obtaining Software and Configuration:

      1. "Specify the [...] along with all of the other Configuration options suggested within the Nutch 1.x tutorial."
          It would be better to copy necessary configuration. I don't have idea which settings exactly should be copied.

      2. "In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive dependency, this is a bug in gora-hbase 0.6.1 as described here. This bug is removed in current Gora development."
        __  What does this step require from me? Should I add something to the dependencies? In which file? This point is written in an informative manner. Should be either deleted or clear instruction should be given.

      3. "N.B. It's probably worth checking and setting all your usual configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before progressing."
         I'ts my first install. There is no such thing as "usual configuration"..

      In "Invoke Nutch":

      1. "nutch readdb" doesn't return anything meaningful apart from Usage. 
        ./nutch readdb
        Usage: WebTableReader (-stats | -url [url] | -dump <out_dir> [-regex regex])
        [-crawlId <id>] [-content] [-headers] [-links] [-text]
        -crawlId <id> - the id to prefix the schemas to operate on,
        (default: storage.crawl.id)
        -stats [-sort] - print overall statistics to System.out
        [-sort] - list status sorted by host
        -url <url> - print information on <url> to System.out
        -dump <out_dir> [-regex regex] - dump the webtable to a text file in
        <out_dir>
        -content - dump also raw content
        -headers - dump protocol headers
        -links - dump links
        -text - dump extracted text
        [-regex] - filter on the URL of the webtable entry

      Attachments

        Activity

          People

            Unassigned Unassigned
            krzysztofmadejski Krzysztof Madejski
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: