Network Weather Service (NWS) was developed for use by schedulers in a networked computational environment.
NWS measures CPU availability and network performance.
It can also forecast the network performance and available CPU percentage for each machine that it monitors.
NWS addresses a hard problem: to measure how much we have we must consume a part of what we measure. It is like uncertainty principle of Heisenberg. NWS design is simple and comprehensive. We can expect that any similar monitoring tool apply the same methods. When we design such tools it is also useful to be aware about the NWS design to avoid reinventing the wheel or frantically designing a sort of perpetual motion machine.
NWS uses four component processes:
A monitored host only needs Sensor processes. There are two kinds of sensor process:
Sensors are the only Operating System specific part of NWS. They run on Unix and Linux. Sensor processes send their measurements to Persistent state processors, which can be remote.
Sensors attach time stamps to their measurements but they don’t try to correlate their measurements. This latter task is typically performed by a forecaster process.
The forecaster process acts as a proxy for the NWS client.
NWS processes are stateless. They are written in C and communicate using TCP/IP.
Name server process
The address of the NWS Name server is the only well-known address used by the system.
All other NWS processes register their name-location binding with the Name Server.
These bindings time out according to a time-to-live parameter that must accompany each registration. Therefore other processes renew their registration periodically.
The CPU Sensor combines information from uptime and vmstat with periodic active occupancy tests to measure the CPU availability.
uptime reports the average number of processes in the run queue over
vmstat gives the idle time, the user and the system time.
An active occupancy test consists of running a compute-intensive "probe" program and calculating the CPU availability as the ratio of its observed CPU occupancy time to the wall-clock time of its execution. The CPU Sensor adaptively adjusts the frequency with which probes are conducted.
The NWS Network Sensor can only rely on probes when determining network load. It makes three kinds of measurement:
To avoid introducing too much network load, network Sensors are organized as a hierarchy of Sensor sets called cliques. This word probably comes from graph theory. In that domain a clique is a region of a graph where each node is connected to the others. For more details you can look at our graph introduction and at our algorithm page.
Each Sensor participating to a clique conducts inter-machine experiments with every other clique members but not with Sensor outside the clique. It is possible to define different cliques at each level of the hierarchy and to promote one representative sensor from each clique to also participate in the clique at the next higher level. A clique can map a sub-network or a region.
To reduce contention within a clique, only a single clique member conducts experiments at a given time. This policy is implemented by passing a clique token among member Sensors.
Persistent state process
A persistent state process provides a text-string storage and retrieval service and keeps the data in a file managed as a circular queue.
A forecaster process requires measurements to Persistent state processes.
Then the forecaster process orders the measurements by time stamp and calls different prediction modules.
The forecaster process keeps track of which prediction module gives the lowest aggregate error measure over time and reports the forecast returned by that module.
The C API implements two functions:
You can download NWS from ftp://nws.cs.utk.edu/pub/nws.
You can find NWS papers at http://nws.cs.ucsb.edu/publications.html.