Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1340

Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • nutchgora
    • None
    • None
    • Patch Available

    Description

      After applying GORA-120 (this already is a huge performance boost by itself) one of the major bottlenecks of the DbUpdaterReducer is the deletion of the markers. The update reducer simply sets every row to delete its markers. A lot of rows do not actually have the markers but the deletes are fired away in any case. Because the markers are already always on the input, a simple check to see if they exist greaty improves performance.

      In particular it is very expensive in HBase, because every single Delete inmediately triggers a connection to the regionservers. (They ignore the "autoflush=false" directive). Although deletes can be done in batch, this is currently not supported by Gora. For one it is very difficult to implement in the current HBaseStore with regard to multithreading, and secondly I noticed performance did not increase significantly.

      By performance debugging on a real life cluster this currently seems to be the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying GORA-120)

      Attachments

        1. NUTCH-1340-v1.txt
          2 kB
          Ferdy
        2. NUTCH-1340-v2.txt
          2 kB
          Ferdy

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ferdy.g Ferdy
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: