Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: nutchgora
    • Component/s: None
    • Labels:
      None

      Description

      This issue will track nutch/hbase integration

      1. hbase_v2.patch
        364 kB
        Doğacan Güney
      2. hbase-integration_v1.patch
        333 kB
        Doğacan Güney
      3. latest-nutchbase-vs-original-branch-point.patch
        3.01 MB
        Doğacan Güney
      4. latest-nutchbase-vs-svn-nutchbase.patch
        2.75 MB
        Doğacan Güney
      5. malformedurl.patch
        4 kB
        Andrew McCall
      6. meta.patch
        7 kB
        Andrew McCall
      7. meta2.patch
        3 kB
        Andrew McCall
      8. nb-design.txt
        2 kB
        Doğacan Güney
      9. nb-installusage.txt
        2 kB
        Doğacan Güney
      10. nofollow-hbase.patch
        6 kB
        Andrew McCall
      11. NUTCH-650.patch
        10 kB
        Xiao Yang
      12. nutch-habase.patch
        8 kB
        Andrew McCall
      13. searching.diff
        26 kB
        Andrew McCall
      14. slash.patch
        2 kB
        Andrew McCall

        Issue Links

          Activity

          Hide
          Doğacan Güney added a comment -

          This patch is what I have done so far. Right now, hbase integration is functional enough that you can inject/generate/fetch from http pages/parse html pages /create a basic index ( Only parse-html, protocol-http and index-basic are updated for hbase )

          Before I go into design, first, a note: Don't worry about the size of the patch I know that it is huge but for simplicity I created a new package (org.apache.nutchbase) and moved code there instead of modifying it directly. So, bulk of the patch is just old code really. In general, if you are interested in reviewing this patch (and I hope you are, interesting parts are: InjectorHbase, GeneratorHbase, FetcherHbase, ParseTable, UpdateTable, IndexerHbase and anything under util.hbase.

          A) Why integrate with hbase?

          • All your data in a central location
          • No more segment/crawldb/linkdb merges.
          • No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration.
          • A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use.

          B) Design
          Design is actually rather straightforward.

          • We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns.
          • So now most jobs just take the name of the table as input.
          • There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates.
          • URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs.
          • CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status.

          Jobs:

          • Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark.
          • InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row.
          • GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker.
          • FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase
          • ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor".
          • UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
          • IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.

          Plugins:

          • Plugins now have a

          Set<String> getColumnSet();

          method. Before starting a job, we ask relevant plugins what exactly they want to read from hbase and read those columns. For example, FetcherHbase reads some columns but doesn't read "modifiedTime:". However, protocol-httphbase needs this column. So the plugin adds this column to its set and FetcherHbase reads "modifiedTime:" when protocol-httphbase is active. This way, plugins read exactly what they want, whenever they want it. For example, during parse normally CrawlDatum's fields are not available. However, with this patch, a parse plugin can ask for any of those fields and they will get it.

          • Also, plugin API is simpler now. Most plugins will look like a variation of this:

          public void doStuff(String url, RowPart row);

          So now a plugin can also choose to update any column it wants.

          C) What's missing

          • A LOT of plugins.
          • No ScoringFilters at all.
          • Converters from old data into hbase
          • GeneratorHbase: no byIP stuff. does not shuffle URLs for fetching. no -adddays
          • FetcherHbase: no byIP stuff. no parsing during fetch. Shuffling is important for performace, but can be fixed. (One solution that comes to mind is to randomly partition URLs into reducers during map, and perform the actual fetching during reduce). Supports following redirects, but not immediately. Http headers are not stored. Since no parsing in fetcher, fetcher always stores content.
          • ParseTable: No multi-parse (i.e ParseResult).
          • IndexerHbase: No way to choose a subset of urls to index. (There is a marker in UpdateTable but I haven't put it in yet)
          • FetchSchedule: prevModifiedTime, prev... stuff are missing as I haven't yet figured a way to read older versions of the same column.

          Most of what's missing is stuff I didn't have time to code. Should be easy to add later on.

          As always, suggestions/reviews are welcome.

          Show
          Doğacan Güney added a comment - This patch is what I have done so far. Right now, hbase integration is functional enough that you can inject/generate/fetch from http pages/parse html pages /create a basic index ( Only parse-html, protocol-http and index-basic are updated for hbase ) Before I go into design, first, a note: Don't worry about the size of the patch I know that it is huge but for simplicity I created a new package (org.apache.nutchbase) and moved code there instead of modifying it directly. So, bulk of the patch is just old code really. In general, if you are interested in reviewing this patch (and I hope you are , interesting parts are: InjectorHbase, GeneratorHbase, FetcherHbase, ParseTable, UpdateTable, IndexerHbase and anything under util.hbase. A) Why integrate with hbase? All your data in a central location No more segment/crawldb/linkdb merges. No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration. A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use. B) Design Design is actually rather straightforward. We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns. So now most jobs just take the name of the table as input. There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates. URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs. CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status. Jobs: Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark. InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row. GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker. FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor". UpdateTable: Does updatedb's and invertlink's job. Also clears any markers. IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully. Plugins: Plugins now have a Set<String> getColumnSet(); method. Before starting a job, we ask relevant plugins what exactly they want to read from hbase and read those columns. For example, FetcherHbase reads some columns but doesn't read "modifiedTime:". However, protocol-httphbase needs this column. So the plugin adds this column to its set and FetcherHbase reads "modifiedTime:" when protocol-httphbase is active. This way, plugins read exactly what they want, whenever they want it. For example, during parse normally CrawlDatum's fields are not available. However, with this patch, a parse plugin can ask for any of those fields and they will get it. Also, plugin API is simpler now. Most plugins will look like a variation of this: public void doStuff(String url, RowPart row); So now a plugin can also choose to update any column it wants. C) What's missing A LOT of plugins. No ScoringFilters at all. Converters from old data into hbase GeneratorHbase: no byIP stuff. does not shuffle URLs for fetching. no -adddays FetcherHbase: no byIP stuff. no parsing during fetch. Shuffling is important for performace, but can be fixed. (One solution that comes to mind is to randomly partition URLs into reducers during map, and perform the actual fetching during reduce). Supports following redirects, but not immediately. Http headers are not stored. Since no parsing in fetcher, fetcher always stores content. ParseTable: No multi-parse (i.e ParseResult). IndexerHbase: No way to choose a subset of urls to index. (There is a marker in UpdateTable but I haven't put it in yet) FetchSchedule: prevModifiedTime, prev... stuff are missing as I haven't yet figured a way to read older versions of the same column. Most of what's missing is stuff I didn't have time to code. Should be easy to add later on. As always, suggestions/reviews are welcome.
          Hide
          Jim Kellerman added a comment -

          +1 from an HBase perspective. Very nice!

          Show
          Jim Kellerman added a comment - +1 from an HBase perspective. Very nice!
          Hide
          Otis Gospodnetic added a comment -

          This sounds great, Doğacan! Simplification is good, so +1 for the approach.
          Are you running this code in some Dev/QA/Prod system? Any observations about any kind of performance or scaling differences?

          Show
          Otis Gospodnetic added a comment - This sounds great, Doğacan! Simplification is good, so +1 for the approach. Are you running this code in some Dev/QA/Prod system? Any observations about any kind of performance or scaling differences?
          Hide
          Doğacan Güney added a comment -

          Thanks Otis !

          > Are you running this code in some Dev/QA/Prod system? Any observations about any kind of performance or scaling differences?

          No unfortunately it is just my laptop for now

          Show
          Doğacan Güney added a comment - Thanks Otis ! > Are you running this code in some Dev/QA/Prod system? Any observations about any kind of performance or scaling differences? No unfortunately it is just my laptop for now
          Hide
          Andrzej Bialecki added a comment -

          Take a look at this, might be a useful inspiration to move this patch forward ...

          http://code.google.com/p/hbase-writer/

          Show
          Andrzej Bialecki added a comment - Take a look at this, might be a useful inspiration to move this patch forward ... http://code.google.com/p/hbase-writer/
          Hide
          Doğacan Güney added a comment -

          New patch. Contains some fixes and:

          • Support page modification detection in nutchbase (store previous signature and fetch time as distinct columns until we get support for scanning multiple versions)
          • A new PluggableHbase interface for nutchbase plugins
          • Converted HtmlParseFilters for nutchbase
          • Index cache-policies in index-basichbase.
          • Added no-caching support to parse-htmlhbase.
          • Added support for content encoding auto detection to nutchbase
          • Do not instantiate a new MimeUtil for every content
          • Added support for (Http-)headers
          Show
          Doğacan Güney added a comment - New patch. Contains some fixes and: Support page modification detection in nutchbase (store previous signature and fetch time as distinct columns until we get support for scanning multiple versions) A new PluggableHbase interface for nutchbase plugins Converted HtmlParseFilters for nutchbase Index cache-policies in index-basichbase. Added no-caching support to parse-htmlhbase. Added support for content encoding auto detection to nutchbase Do not instantiate a new MimeUtil for every content Added support for (Http-)headers
          Hide
          Andrzej Bialecki added a comment -

          This is an important issue, and it would be good if more people could collaborate on it. I suggest that you make a branch, e.g. branches/nutch_hbase, apply this patch there, and continue developing it until it reaches a usable state. Those interested in this direction can contribute patches to the branch, and once we feel it's functionally good enough to replace the current custom DBs we can merge the code to trunk/ .

          Show
          Andrzej Bialecki added a comment - This is an important issue, and it would be good if more people could collaborate on it. I suggest that you make a branch, e.g. branches/nutch_hbase, apply this patch there, and continue developing it until it reaches a usable state. Those interested in this direction can contribute patches to the branch, and once we feel it's functionally good enough to replace the current custom DBs we can merge the code to trunk/ .
          Hide
          Doğacan Güney added a comment -

          I am moving this issue to 1.1.

          I agree with Andrzej's comments with creating a branch to make collaboration easier. But I will wait after 1.0 to do so.

          Show
          Doğacan Güney added a comment - I am moving this issue to 1.1. I agree with Andrzej's comments with creating a branch to make collaboration easier. But I will wait after 1.0 to do so.
          Hide
          Doğacan Güney added a comment -

          OK, I didn't want until after 1.0

          I created a git repository in

          http://github.com/dogacan/nutch.dogacan/tree/master

          I will probably not update this tree until after 1.0, but feel free to play with it.

          Show
          Doğacan Güney added a comment - OK, I didn't want until after 1.0 I created a git repository in http://github.com/dogacan/nutch.dogacan/tree/master I will probably not update this tree until after 1.0, but feel free to play with it.
          Hide
          Andrew McCall added a comment -

          Hi Doğacan,

          I've been running this on a pseudo distributed hadoop/hbase install I setup for the purpose and for testing and development on my own. To get it to run out of the box I needed to make a couple of changes - I've attached a patch with them in it.

          Patch is against the git repository, btw.

          1) I changed the nutch-default.xml to use the hbase classes.
          2) I changed the parse-plugins.xml file to use the hbase classes
          3) I tweaked the IndexerHbase so that the ouput is wrapped in a hadoop.io.Text otherwise it wouldn't work for me.
          4) Altered the WebTableCreator so that it's a command that can be executed from the command line like any other, also made the table name an option like the others.

          This should now checkout, compile and allow the following comands to be run:

          ./bin/nutch org.apache.nutchbase.util.hbase.WebTableCreator webtable

          ./bin/nutch org.apache.nutchbase.crawl.InjectorHbase webtable file:///path/to/urls_dir

          ./bin/nutch org.apache.nutchbase.crawl.GeneratorHbase webtable

          ./bin/nutch org.apache.nutchbase.fetcher.FetcherHbase webtable

          ./bin/nutch org.apache.nutchbase.parse.ParseTable webtable

          ./bin/nutch org.apache.nutchbase.indexer.IndexerHbase /index webtable

          ./bin/nutch org.apache.nutchbase.crawl.UpdateTable webtable

          I've been running a test crawl using this code and it seems to be working well for me.

          Show
          Andrew McCall added a comment - Hi Doğacan, I've been running this on a pseudo distributed hadoop/hbase install I setup for the purpose and for testing and development on my own. To get it to run out of the box I needed to make a couple of changes - I've attached a patch with them in it. Patch is against the git repository, btw. 1) I changed the nutch-default.xml to use the hbase classes. 2) I changed the parse-plugins.xml file to use the hbase classes 3) I tweaked the IndexerHbase so that the ouput is wrapped in a hadoop.io.Text otherwise it wouldn't work for me. 4) Altered the WebTableCreator so that it's a command that can be executed from the command line like any other, also made the table name an option like the others. This should now checkout, compile and allow the following comands to be run: ./bin/nutch org.apache.nutchbase.util.hbase.WebTableCreator webtable ./bin/nutch org.apache.nutchbase.crawl.InjectorHbase webtable file:///path/to/urls_dir ./bin/nutch org.apache.nutchbase.crawl.GeneratorHbase webtable ./bin/nutch org.apache.nutchbase.fetcher.FetcherHbase webtable ./bin/nutch org.apache.nutchbase.parse.ParseTable webtable ./bin/nutch org.apache.nutchbase.indexer.IndexerHbase /index webtable ./bin/nutch org.apache.nutchbase.crawl.UpdateTable webtable I've been running a test crawl using this code and it seems to be working well for me.
          Hide
          Andrew McCall added a comment -

          Patch against the git repo.

          Show
          Andrew McCall added a comment - Patch against the git repo.
          Hide
          Andrew McCall added a comment -

          patch for nutch-hbase that does the same as patch supplied for NUTCH-693

          Show
          Andrew McCall added a comment - patch for nutch-hbase that does the same as patch supplied for NUTCH-693
          Hide
          Andrew McCall added a comment -

          Ran across a MalformedUrlException when running GeneratorHbase

          Stacktrace:

          java.net.MalformedURLException: no protocol: http?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=://answers.yahoo.com/Regional/Regions/Middle_East/Cultures___Community&grp_user=0
          at java.net.URL.<init>(URL.java:567)
          at java.net.URL.<init>(URL.java:464)
          at java.net.URL.<init>(URL.java:413)
          at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:88)
          at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
          at org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:135)
          at org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:108)
          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
          at org.apache.hadoop.mapred.Child.main(Child.java:155)

          The problem lies with URLs in links with no file and no / between the host and the querystring.

          e.g. http://answers.yahoo.com?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=/Regional/Regions/Middle_East/Cultures___Community&grp_user=0

          The first / in the grp_cat field gets interpreted as the beginning of the file.

          Attached patch solves the problem by ensuring / is added after host:port if it doesn't exist. Also includes tests and updates the build.xml to run the tests in org.apache.nutch*/** instead of just org.apache.nutch/**

          Show
          Andrew McCall added a comment - Ran across a MalformedUrlException when running GeneratorHbase Stacktrace: java.net.MalformedURLException: no protocol: http?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=://answers.yahoo.com/Regional/Regions/Middle_East/Cultures___Community&grp_user=0 at java.net.URL.<init>(URL.java:567) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:88) at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286) at org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:135) at org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:108) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child.main(Child.java:155) The problem lies with URLs in links with no file and no / between the host and the querystring. e.g. http://answers.yahoo.com?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=/Regional/Regions/Middle_East/Cultures___Community&grp_user=0 The first / in the grp_cat field gets interpreted as the beginning of the file. Attached patch solves the problem by ensuring / is added after host:port if it doesn't exist. Also includes tests and updates the build.xml to run the tests in org.apache.nutch*/** instead of just org.apache.nutch/**
          Hide
          Andrew McCall added a comment -

          Fixes an issue the above patch creates, if there is a zero length filename without a slash will throw an exception.

          Show
          Andrew McCall added a comment - Fixes an issue the above patch creates, if there is a zero length filename without a slash will throw an exception.
          Hide
          Andrew McCall added a comment -

          I've updated the way the TMP_X_MARK metadata is handled to allow multiple fetch cycles to take place at the same time.

          • GeneratorHbase adds the TMP_FETCH_MARK as before
          • FetcherHbase
            • crawls any rows with TMP_FETCH_MARK set and sets TMP_PARSE_MARK so the Parser knows to parse the row as before
            • removes the column TMP_FETCH_MARK so that any other later fetch between now and calling UpdateTable won't re-fetch the row.
          • ParseTable
            • parses any rows with TMP_PARSE_MARK set and sets TMP_UPDATE_MARK as before
            • removes the column TMP_PARSE_MARK so that a later parse won't re-parse the row.
          • UpdateTable now only updates rows with TMP_UPDATE_MARK set by default leaving rows that have not been fetched or parsed yet in their current state.
          • calling UpdateTable with the new -all option forces UpdateTable to update all rows in the table and acts as it did before the patch removing any TMP_X_MARK rows.
          Show
          Andrew McCall added a comment - I've updated the way the TMP_X_MARK metadata is handled to allow multiple fetch cycles to take place at the same time. GeneratorHbase adds the TMP_FETCH_MARK as before FetcherHbase crawls any rows with TMP_FETCH_MARK set and sets TMP_PARSE_MARK so the Parser knows to parse the row as before removes the column TMP_FETCH_MARK so that any other later fetch between now and calling UpdateTable won't re-fetch the row. ParseTable parses any rows with TMP_PARSE_MARK set and sets TMP_UPDATE_MARK as before removes the column TMP_PARSE_MARK so that a later parse won't re-parse the row. UpdateTable now only updates rows with TMP_UPDATE_MARK set by default leaving rows that have not been fetched or parsed yet in their current state. calling UpdateTable with the new -all option forces UpdateTable to update all rows in the table and acts as it did before the patch removing any TMP_X_MARK rows.
          Hide
          Andrew McCall added a comment -

          Slight update to previous patch, calling deleteMeta during the map phase stopped the scanner from getting anymore results. I've moved it into the reduce and it works properly now.

          Also patched a bug in UpdateTable.java where a bad URL could crash the whole update process. I've just wrapped it in a try/catch block and dumped a warning to the logs.

          Show
          Andrew McCall added a comment - Slight update to previous patch, calling deleteMeta during the map phase stopped the scanner from getting anymore results. I've moved it into the reduce and it works properly now. Also patched a bug in UpdateTable.java where a bad URL could crash the whole update process. I've just wrapped it in a try/catch block and dumped a warning to the logs.
          Hide
          Andrew McCall added a comment -

          Added a few bits, so that you can now search. I've added a very basic NutchBeanHbase which does most of what the previous version of NutchBean does. I'm pretty sure it wont work for distributed searching yet. I've also altered the build.xml to include the hbase jar with the war and altered the jsp pages to use the new NutchBean and the ImmutableRow object I expose.

          I've been testing this on my local machine through a few crawls and it seems to be working well.

          Show
          Andrew McCall added a comment - Added a few bits, so that you can now search. I've added a very basic NutchBeanHbase which does most of what the previous version of NutchBean does. I'm pretty sure it wont work for distributed searching yet. I've also altered the build.xml to include the hbase jar with the war and altered the jsp pages to use the new NutchBean and the ImmutableRow object I expose. I've been testing this on my local machine through a few crawls and it seems to be working well.
          Hide
          Otis Gospodnetic added a comment -

          Doğacan, I think http://github.com/dogacan/nutch.dogacan/tree/master is gone.
          Nutch 1.0 is out.
          Is it time for that branch?

          The work you and Andrew have done so far looks too useful to drop!

          Show
          Otis Gospodnetic added a comment - Doğacan, I think http://github.com/dogacan/nutch.dogacan/tree/master is gone. Nutch 1.0 is out. Is it time for that branch? The work you and Andrew have done so far looks too useful to drop!
          Hide
          Doğacan Güney added a comment -

          I recreated the tree at

          http://github.com/dogacan/nutchbase/tree/nutchbase

          and

          git://github.com/dogacan/nutchbase.git

          Note that now, master branch is the vanilla 1.0 branch of nutch. All development is done in nutchbase branch.

          I have also applied several of Andrew's patches.

          Show
          Doğacan Güney added a comment - I recreated the tree at http://github.com/dogacan/nutchbase/tree/nutchbase and git://github.com/dogacan/nutchbase.git Note that now, master branch is the vanilla 1.0 branch of nutch. All development is done in nutchbase branch. I have also applied several of Andrew's patches.
          Hide
          Doğacan Güney added a comment - - edited

          Many changes.

          First, for simplicity, I changed master branch to be the main development branch. So to take a look at nutchbase simply do:

          git clone git://github.com/dogacan/nutchbase.git

          (sorry Andrew for the random change

          • Upgraded to hbase trunk and hadoop 0.20.
          • FetcherHbase now fetches URLs in reduce(). I added a randomization part so that now reduce does not get URLs from the same host one after another but in a random order. Still politeness rules are followed and one host will always be in one reducer no matter how many URLs it has (at least, that's what I tried to do, testing is welcome .
          • If your fetch is cut short, you almost do not lost any fetched URL as we immediately write the fetched content to the table*. For example, if you are doing a HUGE one day fetch, and at the 20th hour your fetch dies, then 20 hour fetching worth of URLs will already be in hbase. Next execution of FetcherHbase will simply pick up where it left.
          • Same thing for ParseTable. If parse crashes in midstream, next execution will continue at the crash point*.
          • Added a "-restart" option for ParseTable and FetcherHbase. If "-restart" is present then these classes start at the beginning instead of continuing from whereever last run finished.
          • Added a "-reindex" option to IndexerHbase to reindex the entire table (Normally only successfully parsed URLs in that iteration are processed).
          • Added a SolrIndexerHbase so you can use solr with hbase (which is awesome . Also has a "-reindex" option.

          *= We do not immediately write content as hbase client code uses a write buffer to buffer updates. Still, you will lose very few URLs as opposed to all (and write buffer size can be made smaller for more safety)

          There are still some more stuff to go (such as updating scoring for hbase) but most of the stuff is, IMHO, ready. Can I get some reviews about what people think of the general direction, about API, etc? Because this (and katta integration) are my priorities for next nutch.

          Show
          Doğacan Güney added a comment - - edited Many changes. First, for simplicity, I changed master branch to be the main development branch. So to take a look at nutchbase simply do: git clone git://github.com/dogacan/nutchbase.git (sorry Andrew for the random change Upgraded to hbase trunk and hadoop 0.20. FetcherHbase now fetches URLs in reduce(). I added a randomization part so that now reduce does not get URLs from the same host one after another but in a random order. Still politeness rules are followed and one host will always be in one reducer no matter how many URLs it has (at least, that's what I tried to do, testing is welcome . If your fetch is cut short, you almost do not lost any fetched URL as we immediately write the fetched content to the table*. For example, if you are doing a HUGE one day fetch, and at the 20th hour your fetch dies, then 20 hour fetching worth of URLs will already be in hbase. Next execution of FetcherHbase will simply pick up where it left. Same thing for ParseTable. If parse crashes in midstream, next execution will continue at the crash point*. Added a "-restart" option for ParseTable and FetcherHbase. If "-restart" is present then these classes start at the beginning instead of continuing from whereever last run finished. Added a "-reindex" option to IndexerHbase to reindex the entire table (Normally only successfully parsed URLs in that iteration are processed). Added a SolrIndexerHbase so you can use solr with hbase (which is awesome . Also has a "-reindex" option. *= We do not immediately write content as hbase client code uses a write buffer to buffer updates. Still, you will lose very few URLs as opposed to all (and write buffer size can be made smaller for more safety) There are still some more stuff to go (such as updating scoring for hbase) but most of the stuff is, IMHO, ready. Can I get some reviews about what people think of the general direction, about API, etc? Because this (and katta integration) are my priorities for next nutch.
          Hide
          Doğacan Güney added a comment -

          I forgot to add: I also made some API changes so now plugins have

          Collection<HbaseColumn> getColumns();

          This will allow us to support versioning (i.e exposing multiple versions of the same column to plugins) in the future (this is easy to do, but not yet implemented).

          Show
          Doğacan Güney added a comment - I forgot to add: I also made some API changes so now plugins have Collection<HbaseColumn> getColumns(); This will allow us to support versioning (i.e exposing multiple versions of the same column to plugins) in the future (this is easy to do, but not yet implemented).
          Hide
          Andrzej Bialecki added a comment -

          You're making good progress! I think it's time to import this to Nutch SVN, on a branch, so that we can go through a regular patch review process.

          Show
          Andrzej Bialecki added a comment - You're making good progress! I think it's time to import this to Nutch SVN, on a branch, so that we can go through a regular patch review process.
          Hide
          Doğacan Güney added a comment -

          Thanks Andrzej!

          I am also close to finishing the scoring API as well. Once scoring is finished, most of the major functionality will be in. One benefit is that I think we will be able to do the real OPIC scoring finally*. Though I don't know how relevant that is after Dennis' pagerank work.

          How do you suggest we do the import? Create a new branch (with my choice of name being 'nutchbase' , close this issue and open a new issue for new features or bug fixes?

          • With constant 'cash' distribution to pages.
          Show
          Doğacan Güney added a comment - Thanks Andrzej! I am also close to finishing the scoring API as well. Once scoring is finished, most of the major functionality will be in. One benefit is that I think we will be able to do the real OPIC scoring finally*. Though I don't know how relevant that is after Dennis' pagerank work. How do you suggest we do the import? Create a new branch (with my choice of name being 'nutchbase' , close this issue and open a new issue for new features or bug fixes? With constant 'cash' distribution to pages.
          Hide
          Andrzej Bialecki added a comment -

          "nutchbase" is ok for now, although it sounds cryptic. +1 on importing and closing this issue.

          I don't believe OPIC scoring can work well, even if we implement it as intended - the dynamic nature of the webgraph is IMHO not properly addressed even in the original paper (authors propose a smoothing schema based on a history of past values). In my opinion we should strive to create a more elegant scoring API than the current one (which owes much to the way Nutch passed bits of data between different data stores), and use PageRank as the default.

          Re: use of Katta for distributed indexing - let's discuss this on the list.

          Show
          Andrzej Bialecki added a comment - "nutchbase" is ok for now, although it sounds cryptic. +1 on importing and closing this issue. I don't believe OPIC scoring can work well, even if we implement it as intended - the dynamic nature of the webgraph is IMHO not properly addressed even in the original paper (authors propose a smoothing schema based on a history of past values). In my opinion we should strive to create a more elegant scoring API than the current one (which owes much to the way Nutch passed bits of data between different data stores), and use PageRank as the default. Re: use of Katta for distributed indexing - let's discuss this on the list.
          Hide
          Doğacan Güney added a comment -

          I am getting ready to merge nutchbase into an svn branch.

          I have moved the code into org.apache.nutch package (instead of org.apache.nutchbase). This causes many changes to nutch core (that would be needed anyway), so I ended up deleting a lot of old classes. I kept CrawlDatum, Content, etc... around for data conversion but it is possible that I missed something.

          Some packages such as arc, webgraph scoring etc... also had to go as they depended on old nutch structures. I will add them as I convert more code for nutchbase.

          Any objections to bringing the code into org.apache.nutch?

          Show
          Doğacan Güney added a comment - I am getting ready to merge nutchbase into an svn branch. I have moved the code into org.apache.nutch package (instead of org.apache.nutchbase). This causes many changes to nutch core (that would be needed anyway), so I ended up deleting a lot of old classes. I kept CrawlDatum, Content, etc... around for data conversion but it is possible that I missed something. Some packages such as arc, webgraph scoring etc... also had to go as they depended on old nutch structures. I will add them as I convert more code for nutchbase. Any objections to bringing the code into org.apache.nutch?
          Hide
          Andrzej Bialecki added a comment -

          We already have some compat stuff in o.a.n.util.compat, mostly related to 0.7 and early 0.8 conversion. I guess we can drop this stuff from the new branch.

          This is a bigger question of back-compat. What data is it worth to convert and preserve? I'd say the following: CrawlDb and perhaps unparsed content. Everything else can be generated from this data.

          With such major changes I'm in favor of a limited back-compat based on converter tools, and not on back-compat shims scattered throughout the code. So feel free to morph the core classes as you see fit according to the requirements of the new design.

          And answering your question: no objections here.

          Show
          Andrzej Bialecki added a comment - We already have some compat stuff in o.a.n.util.compat, mostly related to 0.7 and early 0.8 conversion. I guess we can drop this stuff from the new branch. This is a bigger question of back-compat. What data is it worth to convert and preserve? I'd say the following: CrawlDb and perhaps unparsed content. Everything else can be generated from this data. With such major changes I'm in favor of a limited back-compat based on converter tools, and not on back-compat shims scattered throughout the code. So feel free to morph the core classes as you see fit according to the requirements of the new design. And answering your question: no objections here.
          Hide
          Doğacan Güney added a comment -

          I just committed code to branch nutchbase. The scoring API did not turn out as clean as I expected but I decided to put in what I have. Also, I made some changes so that web UI also works.

          I am leaving this issue open because I will add documentation tomorrow. Meanwhile,

          To download:

          svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase

          Usage:

          After starting hbase 0.20 (checkout rev. 804408 from hbase branch 0.20), create a webtable with

          bin/nutch createtable webtable

          After that, usage is similar.

          bin/nutch inject webtable url_dir # inject urls

          for as many cycles as you want;
          bin/nutch generate webtable #-topN N works
          bin/nutch fetch webtable # -threads N works
          bin/nutch parse webtable
          bin/nutch updatetable webtable

          bin/nutch index <index> webtable
          or
          bin/nutch solrindex <solr url> webtable

          To use solr, use this schema file
          http://www.ceng.metu.edu.tr/~e1345172/schema.xml

          Again, a note of warning: This is extremely new code. I hope people will test and use it but there is no guarantee that it will work

          Show
          Doğacan Güney added a comment - I just committed code to branch nutchbase. The scoring API did not turn out as clean as I expected but I decided to put in what I have. Also, I made some changes so that web UI also works. I am leaving this issue open because I will add documentation tomorrow. Meanwhile, To download: svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase Usage: After starting hbase 0.20 (checkout rev. 804408 from hbase branch 0.20), create a webtable with bin/nutch createtable webtable After that, usage is similar. bin/nutch inject webtable url_dir # inject urls for as many cycles as you want; bin/nutch generate webtable #-topN N works bin/nutch fetch webtable # -threads N works bin/nutch parse webtable bin/nutch updatetable webtable bin/nutch index <index> webtable or bin/nutch solrindex <solr url> webtable To use solr, use this schema file http://www.ceng.metu.edu.tr/~e1345172/schema.xml Again, a note of warning: This is extremely new code. I hope people will test and use it but there is no guarantee that it will work
          Hide
          Xiao Yang added a comment -

          Exception:

          org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family mtdt: does not exist in region crawl,,1264048608430 in table {NAME => 'crawl', FAMILIES => [

          {NAME => 'bas', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'cnt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'cnttyp', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'fchi', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'fcht', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'hdrs', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'ilnk', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'modt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'mtdt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'olnk', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'prsstt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'prtstt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'prvfch', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'prvsig', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'repr', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'rtrs', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'scr', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'sig', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'stt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'ttl', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ,

          {NAME => 'txt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

          ]}
          at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381)
          at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241)
          at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
          at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834)
          at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)

          at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

          at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
          at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
          at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
          at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
          at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
          at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995)
          at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193)
          at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115)
          at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201)
          at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
          at org.apache.hadoop.hbase.client.HTable.put(HTable.java:470)
          at org.apache.nutch.crawl.Injector$UrlMapper.map(Injector.java:92)
          at org.apache.nutch.crawl.Injector$UrlMapper.map(Injector.java:62)
          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
          at org.apache.hadoop.mapred.Child.main(Child.java:170)

          This may be caused by invalid column family names:
          I found some names end with colon while some doesn't in package org.apache.nutch.util.hbase.WebTableColumns
          Is this a bug?
          public interface WebTableColumns {
          public static final String BASE_URL_STR = "bas";
          public static final String STATUS_STR = "stt";
          public static final String FETCH_TIME_STR = "fcht";
          public static final String RETRIES_STR = "rtrs";
          public static final String FETCH_INTERVAL_STR = "fchi";
          public static final String SCORE_STR = "scr";
          public static final String MODIFIED_TIME_STR = "modt";
          public static final String SIGNATURE_STR = "sig";
          public static final String CONTENT_STR = "cnt";
          public static final String CONTENT_TYPE_STR = "cnttyp:";
          public static final String TITLE_STR = "ttl:";
          public static final String OUTLINKS_STR = "olnk:";
          public static final String INLINKS_STR = "ilnk:";
          public static final String PARSE_STATUS_STR = "prsstt:";
          public static final String PROTOCOL_STATUS_STR = "prtstt:";
          public static final String TEXT_STR = "txt:";
          public static final String REPR_URL_STR = "repr:";
          public static final String HEADERS_STR = "hdrs:";
          public static final String METADATA_STR = "mtdt:";

          Show
          Xiao Yang added a comment - Exception: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family mtdt: does not exist in region crawl,,1264048608430 in table {NAME => 'crawl', FAMILIES => [ {NAME => 'bas', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'cnt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'cnttyp', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'fchi', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'fcht', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'hdrs', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'ilnk', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'modt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'mtdt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'olnk', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'prsstt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'prtstt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'prvfch', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'prvsig', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'repr', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'rtrs', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'scr', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'sig', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'stt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'ttl', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} , {NAME => 'txt', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} ]} at org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115) at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:470) at org.apache.nutch.crawl.Injector$UrlMapper.map(Injector.java:92) at org.apache.nutch.crawl.Injector$UrlMapper.map(Injector.java:62) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) This may be caused by invalid column family names: I found some names end with colon while some doesn't in package org.apache.nutch.util.hbase.WebTableColumns Is this a bug? public interface WebTableColumns { public static final String BASE_URL_STR = "bas"; public static final String STATUS_STR = "stt"; public static final String FETCH_TIME_STR = "fcht"; public static final String RETRIES_STR = "rtrs"; public static final String FETCH_INTERVAL_STR = "fchi"; public static final String SCORE_STR = "scr"; public static final String MODIFIED_TIME_STR = "modt"; public static final String SIGNATURE_STR = "sig"; public static final String CONTENT_STR = "cnt"; public static final String CONTENT_TYPE_STR = "cnttyp:"; public static final String TITLE_STR = "ttl:"; public static final String OUTLINKS_STR = "olnk:"; public static final String INLINKS_STR = "ilnk:"; public static final String PARSE_STATUS_STR = "prsstt:"; public static final String PROTOCOL_STATUS_STR = "prtstt:"; public static final String TEXT_STR = "txt:"; public static final String REPR_URL_STR = "repr:"; public static final String HEADERS_STR = "hdrs:"; public static final String METADATA_STR = "mtdt:";
          Hide
          Xiao Yang added a comment - - edited

          Some instructions for NUTCH-650.patch
          1. API in hbase-0.20.0-r804408.jar is different from the final release.
          2. Avoid some NullPointer error
          3. Change invalid Column family name
          4. Add "id" field to index to avoid this error:
          java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored
          at org.apache.lucene.document.Field.(Field.java:279)
          at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:136)
          at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:245)
          at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:46)
          at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
          at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
          at org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:79)
          at org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:20)
          at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
          at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:563)
          at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
          at org.apache.hadoop.mapred.Child.main(Child.java:170)

          Show
          Xiao Yang added a comment - - edited Some instructions for NUTCH-650 .patch 1. API in hbase-0.20.0-r804408.jar is different from the final release. 2. Avoid some NullPointer error 3. Change invalid Column family name 4. Add "id" field to index to avoid this error: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.(Field.java:279) at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:136) at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:245) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:46) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:79) at org.apache.nutch.indexer.IndexerReducer.reduce(IndexerReducer.java:20) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:563) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.Child.main(Child.java:170)
          Hide
          Piet Schrijver added a comment -

          Xiao's patch works for HBase 0.20.3

          Show
          Piet Schrijver added a comment - Xiao's patch works for HBase 0.20.3
          Hide
          Chris A. Mattmann added a comment -
          Show
          Chris A. Mattmann added a comment - pushing this out per http://bit.ly/c7tBv9
          Hide
          Soila Pertet added a comment -

          I encountered the following NULL exception while running nutchbase.

          2010-04-24 01:58:47,012 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.NullPointerException at org.apache.hadoop.hbase.io.ImmutableBytesWritable.<init>(ImmutableBytesWritable.java:59) at org.apache.nutch.fetcher.Fetcher$FetcherMapper.map(Fetcher.java:81) at org.apache.nutch.fetcher.Fetcher$FetcherMapper.map(Fetcher.java:77) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170)

          I downloaded nutchbase from svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase and applied Xiao's patch. I am running hadoop-0.20.3, hbase-0.20.3 and zookeeper-3.2.2.

          In my application the error occurs after the first iteration of the fetch/generate cycle and is limited to the base url with a generator mark=csh, e.g.:
          keyvalues=

          {host:http:8080/wikipedia/de/de/index.html/mtdt:_csh_/1272088691273/Put/vlen=4}

          But it works fine for values with generator mark=genmrk, e.g.,:
          keyvalues=

          {host:http:8080/wikipedia/de/de/images/wikimedia-button.png/mtdt:__genmrk__/1272088714395/Put/vlen=4, host:http:8080/wikipedia/de/de/images/wikimedia-button.png/mtdt:_csh_/1272088691109/Put/vlen=4}

          I modified my map function to check for null values in outKeyRaw in org.apache.nutch.fetcher.Fetcher$FetcherMapper.map. This masks the error but I am not sure if this is the right action to take. Please let me know.

          Thanks.

          Show
          Soila Pertet added a comment - I encountered the following NULL exception while running nutchbase. 2010-04-24 01:58:47,012 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.NullPointerException at org.apache.hadoop.hbase.io.ImmutableBytesWritable.<init>(ImmutableBytesWritable.java:59) at org.apache.nutch.fetcher.Fetcher$FetcherMapper.map(Fetcher.java:81) at org.apache.nutch.fetcher.Fetcher$FetcherMapper.map(Fetcher.java:77) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) I downloaded nutchbase from svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase and applied Xiao's patch. I am running hadoop-0.20.3, hbase-0.20.3 and zookeeper-3.2.2. In my application the error occurs after the first iteration of the fetch/generate cycle and is limited to the base url with a generator mark=csh, e.g.: keyvalues= {host:http:8080/wikipedia/de/de/index.html/mtdt:_csh_/1272088691273/Put/vlen=4} But it works fine for values with generator mark=genmrk, e.g.,: keyvalues= {host:http:8080/wikipedia/de/de/images/wikimedia-button.png/mtdt:__genmrk__/1272088714395/Put/vlen=4, host:http:8080/wikipedia/de/de/images/wikimedia-button.png/mtdt:_csh_/1272088691109/Put/vlen=4} I modified my map function to check for null values in outKeyRaw in org.apache.nutch.fetcher.Fetcher$FetcherMapper.map. This masks the error but I am not sure if this is the right action to take. Please let me know. Thanks.
          Hide
          Doğacan Güney added a comment -

          Here are two patches for generated using git vs two different branch points:

          1) Patch generated against svn revision 790789. This was the branch point for original nutchbase work. This revision is slightly (but not much) newer than nutch-1.0.
          2) Patch generated against current svn nutchbase.

          Both should apply cleanly.

          Show
          Doğacan Güney added a comment - Here are two patches for generated using git vs two different branch points: 1) Patch generated against svn revision 790789. This was the branch point for original nutchbase work. This revision is slightly (but not much) newer than nutch-1.0. 2) Patch generated against current svn nutchbase. Both should apply cleanly.
          Hide
          Andrzej Bialecki added a comment -

          So far as one can digest such a giant patch I think this is ok, at least from the legal POV it clarifies the situation and it doesn't bring any dependencies with incompatible licenses. As for the content itself, we'll need to resolve this incrementally, as discussed on the list.

          So, a cautious +1 from me to apply this on branches/nutchbase.

          Show
          Andrzej Bialecki added a comment - So far as one can digest such a giant patch I think this is ok, at least from the legal POV it clarifies the situation and it doesn't bring any dependencies with incompatible licenses. As for the content itself, we'll need to resolve this incrementally, as discussed on the list. So, a cautious +1 from me to apply this on branches/nutchbase.
          Hide
          Doğacan Güney added a comment -

          I have written a short installation guide and a short design guide for nutchbase. Design guide is especially short because nutchbase's current design is mostly still the same. So reading through this issue should give you a good idea on design.

          If anything is unclear, please ask and I will try to clarify as best as I can.

          Show
          Doğacan Güney added a comment - I have written a short installation guide and a short design guide for nutchbase. Design guide is especially short because nutchbase's current design is mostly still the same. So reading through this issue should give you a good idea on design. If anything is unclear, please ask and I will try to clarify as best as I can.
          Hide
          Julien Nioche added a comment -

          The patch has been committed with revision # 959259. The content of https://svn.apache.org/repos/asf/nutch/branches/nutchbase is now the same as github.

          Show
          Julien Nioche added a comment - The patch has been committed with revision # 959259. The content of https://svn.apache.org/repos/asf/nutch/branches/nutchbase is now the same as github.
          Hide
          Julien Nioche added a comment -

          NutchBase is now in the trunk + most of the issues listed in this JIRA refer to an older, pre-GORA version of NutchBase.
          Close?

          Show
          Julien Nioche added a comment - NutchBase is now in the trunk + most of the issues listed in this JIRA refer to an older, pre-GORA version of NutchBase. Close?
          Hide
          Chris A. Mattmann added a comment -

          +1, this should be wrapped up.

          Show
          Chris A. Mattmann added a comment - +1, this should be wrapped up.
          Hide
          Amr Awadallah added a comment -

          I am out of office on vacation and will be slower than usual in
          responding to emails. If this is urgent then please call my cell phone
          (or send an sms), otherwise I will reply to your email when I get
          back.

          Thanks for your patience,

          – amr

          Show
          Amr Awadallah added a comment - I am out of office on vacation and will be slower than usual in responding to emails. If this is urgent then please call my cell phone (or send an sms), otherwise I will reply to your email when I get back. Thanks for your patience, – amr
          Hide
          Doğacan Güney added a comment -

          +1 and a YEY! from me.

          Show
          Doğacan Güney added a comment - +1 and a YEY! from me.
          Show
          Markus Jelsma added a comment - Bulk close of resolved issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

            People

            • Assignee:
              Doğacan Güney
              Reporter:
              Doğacan Güney
            • Votes:
              12 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development