Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-492

Allow for output inspection in realtime; perhaps in log files, but somewhere?

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Logging
    • None

    Description

      Many queries take a long time to complete, and then fail (either because the job fails or because the output data is not what was desired).

      This is almost always traceable to, of course, an error in a mapper or a reducer, which we can check or verify via multiple methods, most often running the query piece-by-piece and seeing where the "wrong" output is. This process is time-consuming and requires a decent amount of load on the system (e.g., repeating big queries while trying to debug transformers/syntax). This problem is a bigger deal when a single query uses multiple transforms and several mapreduce steps.

      One way to potentially reduce the amount of overhead in debugging would be to provide actual output in some logging mechanism. Specifically, I mean to have EVERY mapper and/or reducer write the first five lines of output to some user-readable file. This would allow a user to see what each part of the system is doing, and to potentially locate, in ONE failed query statement, where the user error is.

      Of course, 5 lines * 20000 mappers * 300 reducers is a lot of overhead; making this user-configurable and/or estimated beforehand (at least 5 lines from at least 5 mappers and at least 5 reducers) would be fine, as would making these output logs auto-delete after some timeframe (a day, perhaps).

      Attachments

        Activity

          People

            Unassigned Unassigned
            akramer Adam Kramer
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: