[HADOOP-3376] [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.18.0
Component/s: contrib/hod
Labels:
None

Hadoop Flags:

Reviewed
Release Note:

Hide
Modified HOD client to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.

Show
Modified HOD client to look for specific messages related to resource limit overruns and take appropriate actions - such as either failing to allocate the cluster, or issuing a warning to the user. A tool is provided, specific to Maui and Torque, that will set these specific messages.

Description

Currently If we set up resource manager/scheduler limits on the jobs submitted, any HOD cluster that exceeds/violates these limits may 1) get blocked/queued indefinitely or 2) blocked till resources occupied by old clusters get freed. HOD should detect these scenarios and deal intelligently, instead of just waiting for a long time/ for ever. This means more and proper information to the submitter.

(Internal) Use Case:
If there are no resource limits, users can flood the resource manager queue preventing other users from using the queue. To avoid this, we could have various types of limits setup in either resource manager or a scheduler - max node limit in torque(per job limit), maxproc limit in maui (per user/class), maxjob limit in maui(per user/class) etc. But there is one problem with the current setup - for e.g if we set up maxproc limit in maui to limit the aggregate number of nodes by any user over all jobs, 1) jobs get queued indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max limit, but some of the resources are already used by jobs from the same user. This issue addresses how to deal with scenarios like these.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-3376
13/May/08 08:58
7 kB
Vinod Kumar Vavilapalli
checklimits.sh
13/May/08 11:04
0.7 kB
Vinod Kumar Vavilapalli
HADOOP-3376.1
19/May/08 13:46
15 kB
Vinod Kumar Vavilapalli
HADOOP-3376.2
23/May/08 12:01
16 kB
Vinod Kumar Vavilapalli

Activity

People

Assignee:: Vinod Kumar Vavilapalli

Reporter:: Vinod Kumar Vavilapalli

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 13/May/08 05:03

Updated:: 22/Aug/08 19:50

Resolved:: 03/Jun/08 13:39