Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2005

Implement HTrace'ing in Nutch

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Auto Closed
    • None
    • 2.5
    • build

    Description

      Recent developments within the tracing community have brought projects like Apache HTrace (Incubating) into the Apache Incubator opening up the possibility of utilizing tracing logic to better understand distributed applications, systems and systems-of-systems. As many will know, tracing involves a specialized use of logging to record information about a program’s execution. Although many use cases involve the use of tracing within distributed systems such as Hadoop and databases, few tracing experiments belong within the field of large scale, distributed Web search.
      This issue will combine comprehensive tracing mechanisms in Apache HTrace (Incubating) with the scalable, flexible crawling architecture presented by Apache Nutch 2.X.
      As essentially every job (Inject, Generate, Fetch Parse, UpdateDB, etc.) in Nutch 2.X interacts with a stack of complex underlying components (known as the search stack) comprehensive tracing would provide insight into system performance, latency, etc.
      Every job (a class which extends NutchTool and implements Tool) within Nutch 2.X therefore needs to be analyzed for suitability and appropriateness for tracing. Once this is understood a ranked list of tools should be produced, the ranking will be based upon which tools are most suited to tracing... I would suggest that FetcherJob be the top as it enables us to trace not only the HTTPSocketConnections but also writing of data through Gora --> DataStore.

      Attachments

        Activity

          People

            lewismc Lewis John McGibbney
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: