|
|
|
|
|
|
|
PageBox Grid ForewordIf you are not familiar with Grid computing idea you can read "The Grid: Blueprint for a New Computing Infrastructure" edited by Ian Foster and Carl Kesselman at Morgan Kaufmann. I had a trainee one year ago, Leo Tingvall who wrote a thesis, which is a good introduction. For years designers try to prove the value of their ideas through comparisons with ideas that worked in other domains. Design patterns authors claim to reuse an architect idea. For year we heard managers who want to industrialize software development. In the same way grid authors try to find grid roots in railways and electric power grids. It is not the same thing. A computational grid is much cheaper and much harder to use than an electric grid:
Grid computing is a new buzzword. Technology and even products existed before. Then we were speaking about Global Operating Systems, Distributed computing... Grid computing borrows to middleware (Tuxedo, CORBA). The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. Grid architecture has five layers:
Grid is appealing because:
"When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances." Says a Gilder Technology Report. Network throughput now tends to grow even faster than computer speed. It is generally believed that sometime in the future the Moore law (2 time faster every 18 months) will no longer apply to computers. Though it can be like oil shortage, there is a fact, which is that in computer industry small fishes always eat big ones. Unix servers killed supercomputers. CMOS mainframes killed ECL mainframes. The difference in processing power between a PDA and a PC is already shrinking. However a higher throughput doesn’t necessarily imply a higher speed. On Internet when we must exchange large numbers of small messages latency is more important. Latency is caused by routers and routers are computers. The long-term solution is therefore to distribute the workload to large sets of computers as close as possible to each other to avoid routing. We will need to cache more and better than we do today. We need something reliable and easy to install and configure. We believe that the processing power will move from the corporate glasshouses to the ASPs and to the user PCs. To enable that evolution we need a set of solutions:
PageBox needs a Grid capability:
This document describes the PageBox Grid function. DefinitionGrid computing applies the resources of many computers in a network to a single problem at the same time. Grid computing requires the use of software that can divide and farm out pieces of a program to as many as several thousand computers. Grid computing can be thought of as distributed and large-scale cluster computing and as a form of network-distributed parallel processing. The idea is to offer in computational grids the same computer services as the conventional computers. Computer servicesConventional computers provide services that involve shared resources. Then a computation includes the following steps:
The conventional computer implements a resource allocation policy and implements a scheduling dealing with competing demands and quality of service requirements. Parallel programming modelsParallel programming models find their origin in parallel processors and Flynn taxonomy:
Master-WorkerOne node (typically the requestor node) acts as the controlling master for the parallel application and sends pieces of work out to worker nodes. The worker node makes the computation and sends the result back to the master node. The master has a pool of work that needs to be done, so it assigns the next piece of work out to the next worker that becomes available. Because this model only requires one messaging request per piece of work, it can be effective even on high latency networks and doesn't require a high performance API. A parallel application designed for the master-worker model has three steps:
In some cases we must distribute. Let's consider the case of document search. You have a document on your machine and you need to find similar documents in your organization. You define an algorithm to measure a distance between two documents. Then you use the Master-Worker model to distribute your search code to the document repositories in your organization. Your workers return the documents at the shortest distance of your reference document and your master displays them. SPMDYou can see workers and the example above as coarse-grained Single Program Multiple Data (SPMD) model. Let's consider a simple example of code:
In standard programming we will process sequentially all Zorglub instances. With SPMD, we distribute the process requests among the available workers and the processing of the Zorglub instances will run in parallel. The model is effective if the messaging cost is smaller than the processing cost. It can imply:
MIMDMultiple Instruction Multiple Data architectures consist of a number of processors that can each execute individual instruction streams. They can use a shared memory or distributed memories. In the latter case each processor has its own memory. In a Grid architecture there is no hardware cost but processors have necessarily their own memory. In order to access remote data the processors must communicate via explicit message passing. A Grid architecture is intrinsically a MIMD architecture that we use most of the time in SPMD mode with a Master-Worker model. InterfacesYou can find a comparison of PVM and MPI at http://www.epm.ornl.gov/pvm/PVMvsMPI.ps. Today MPI is more popular. PVMThe PVM computing model is simple yet very general, and accommodates a wide variety of application program structures. The programming interface is deliberately straightforward, thus permitting simple program structures to be implemented in an intuitive manner. The user writes his application as a collection of cooperating tasks. Tasks access PVM resources through a library of standard interface routines. These routines allow the initiation and termination of tasks across the network as well as communication and synchronization between tasks. The PVM message-passing primitives are oriented towards heterogeneous operation, involving strongly typed constructs for buffering and transmission. Communication constructs include those for sending and receiving data structures as well as high-level primitives such as broadcast, barrier synchronization, and global sum. PVM has three key principles:
Drawbacks:
With Harness PVM should evolve toward a pluggable computing environment that could be called Generalized plug-in machine (GPM). You can read http://www.csm.ornl.gov/harness/nextGen.ps for more information. Here is an example of PVM program.
Each process has a task id. A process dynamically creates child processes that run on other nodes. Then it receives messages that come from the children whose creation succeeded. Child processes send messages to their parent. You can find more information at http://www.netlib.org/pvm3/book/pvm-book.html. MPIMPI stands for Message Passing Interface. This interface is designed for MIMD and SPMD programming. The MPI standard includes:
MPI was designed for scientist and therefore defines binding for C and Fortran. For one part, the complexity of MPI is due to the support of those languages and to interoperability requirements. In that respect MPI compares to CORBA. The MPI official site is http://www.mpi-forum.org/. We recommend reading the MPI version 2 document, the MPI version 1.1 document and the journal of development. Here is a short informal introduction to MPI. MPI is made of a daemon or Windows service and of a function library that you install on every node. Then you use an MPI command to start a process in parallel on all nodes. Here is an example of process code:
All processes run this code. The processes with a rank > 0 send a message to the process with rank 0 (the destination). The process with rank = 0 receives the messages of all other processes. MPI_Send and MPI_Receive are typical middleware verbs. MPI also supports collective verbs. Those verbs, such as MPI_Gather and MPI-Scatter, take a root parameter and act differently if they are called from a process with rank = root. For instance with MPI_Gather, processes with a rank different of root send a message and the process with a rank = root receives the message. You no longer need to code conditions and loops and the MPI daemon and library can optimise your request. In the same way you can specify reduction operations that can take place in intermediate nodes to achieve a higher level of parallelism. MPI is efficient but fault-tolerance is not its strong point: if a process crashes all other processes get an error and have no choice but to end. With the LAM implementation it is possible though not easy to implement a limited form of fault tolerance. The situation is the same with MPI-2 but it requires extra coding. FT-MPI aims to address this drawback. With FT-MPI communicators have different failure modes: The communicator can shrink when a process fails or force the creation of new processes to fill the emptied ranks. You can find interesting documents about FT-MPI (also called Harness, which stands for Heterogeneous Adaptable Reconfigurable Networked Systems) at http://icl.cs.utk.edu/harness/. Grid environmentsGrid environments are typically available on Unix (Solaris) and on Linux. Sometimes they also run on Windows NT. WebOSWe recommend reading first the PhD dissertation of Amin Vahdat. This document has been a great source of inspiration for us. The WebOS parent project, NOW wrapped up in 1998, officially ending the WebOS project. WebOS introduces an Active Naming mechanism. Active Naming interposes programs behind the naming interface. This approach has several benefits compared to the existing DNS:
A local Active naming program can check which servers are up, use a growing or shrinking number of servers and balance requests between the less loaded servers. Active naming provides a framework for binding names to programs and for chains of programs to cooperate in translating a name to the data that name represents. Active Names are strings that take a format namespace/name. The namespace identifies a Namespace program that knows how to translate the name field of an Active Name into the resource it represents. To find the resource location, the Namespace program typically calls a hierarchy of name resolvers. CondorCondor is a specialized workload management system for compute-intensive jobs developed by the University of Wisconsin. "The goal of the Condor Project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources." Condor team introduced a distinction between High Performance Computing (HPC) and High Throughput Computing (HTC). The key to HTC is effective management and exploitation of all available computing resources. The main challenge a typical HTC environment faces is how to maximize the amount of resources accessible to its customers and not to provide a high performance. As a workload management system, Condor only knows jobs. This allows setting Condor to only use idle machines and to suspend jobs when the machine is used for another task. With parallel programs using PVM or MPI the participating machines must be dedicated to computation or at least have spare resources. You can download Condor on http://www.cs.wisc.edu/condor/downloads/. Condor runs on Windows NT. PVMPVM is also a product developed by the Oak Ridge National Laboratory that you can download at http://www.csm.ornl.gov/pvm/. Adam Ferrari has written a JPVM package that you can download at http://www.cs.virginia.edu/jpvm/src/jpvm.zip. It is not interoperable with PVM but it is portable (100% Java) and it provides a similar API with a better thread support. MPICHMPICH is a product implementing the MPI protocol and developed by Argonne National Laboratory. MPICH runs on Unix, Linux and Windows. The Windows version supports Visual C++ and Visual Fortran as well as gcc. MPICH is free and you can download its sources. On Unix and Linux MPICH can run on top of Globus (MPICH-G2). You can download MPICH at http://www-unix.mcs.anl.gov/mpi/mpich/download.html. The site contains very interesting documents on its design. You can check http://www-unix.mcs.anl.gov/mpi/mpich/papers/mpicharticle/paper.html#Node0 and http://www-unix.mcs.anl.gov/mpi/mpich/workingnote/adi2impl/note.html#Node0. LAMLAM stands for Local Area Multi-computer and also implements MPI. LAM runs on Unix and Linux. Like MPICH LAM is free and you can download its sources. Compared to MPICH,
You can download LAM at http://www.lam-mpi.org/download/. GlobusGlobus is the most important project in Grid computing and the Globus toolkit is the de facto standard for major protocols and services. The Globus toolkit is the most comprehensive implementation of the Grid concept. Globus site is http://www.globus.org/. Globus toolkit is big and complex. For that reason it is only available on Linux and on some Unixes (Solaris, AIX, IRIX and Tru64). Globus toolkit has three pillars:
Once you have downloaded and installed the Globus toolkit, you can also install optional components:
The Globus site contains a huge amount of documentation and the toolkit itself is an open architecture, open source software toolkit. The Open Grid Services Architecture (OGSA) is a proposed evolution of the current Globus Toolkit towards a Grid system architecture based on an integration of Grid and Web services concepts and technologies. It could be called Globus toolkit 3.0. National Science Foundation uses the Globus toolkit (as well as Condor) in its middleware initiative NMI-R1. You can download that package at http://www.nsf-middleware.org/. Glossary
LegionLegion aims to provide a single, coherent virtual machine addressing scalability, programming ease, fault tolerance, security and site autonomy within a reflective, object-based meta-system. It is an ambitious project: Legion could consist of millions of hosts. Organizations that provide a set of hosts can keep control of their resources. Legion must be extensible, scalable and easy to use. It must hide machine boundaries. Legion must support multiple languages and even legacy code. Perhaps the most interesting part of the Legion effort is the Mentat language. The name is a bit surprising. You perhaps remember that in Dune computers were forbidden by a butlerian Jihad and some humans, the mentats were trained to replace them. Mentat is an object-oriented, control-parallel programming system designed to address three problems:
Mentat has two primary components:
Mentat has a couple of competitors:
Legion site is http://legion.virginia.edu. The Legion software is currently available to qualified educational, research, and commercial customers. You need to send a mail to get a copy. You can also need to contact Avaki, the corporate distributor of Legion software. NetSolveOverviewLow-level APIs such as MPI have several drawbacks:
Many generic modules are available. You can find lists at http://www.netlib.org/master/expanded_liblist.html, http://rib.cs.utk.edu/cgi-bin/catalog.pl?rh=222, http://www.nhse.org/software.html and http://www.netlib.org/utk/people/JackDongarra/projects.htm. NetSolve was designed to provide a fast, efficient, easy-to-use system to effectively solve large computational problems using such modules. Issues such as Networking, Heterogeneity, Portability, Numerical Computing, Fault Tolerance and Load Balancing are dealt with by NetSolve. NetSolve provides a simple client API in Fortran, C, Mathlab and Mathematica. A new Java flavor could be available sometime in the future. The simplest function has this prototype: int netsl(char *problem_name, ... < argument list > ...) where problem_name is a module supported by the invoked NetSolve environment. When you call this function, NetSolve analyzes your arguments (typically the size of your input matrices), the available instances of the module and how busy are their hosts. Then NetSolve distributes your work to the most appropriate instance. In case of failure NetSolve can retry the computation. NetSolve supports parallel processing. However its function is to allow distributed computing. Design
NetSolve has three components:
The NetSolve agent maintains a database of NetSolve servers along with their capabilities (hardware performance and allocated software). It uses an NWS Forecaster to get the dynamic characteristics of the NetSolve servers (available CPU and network performance). NWS is a very interesting system that we describe here. It is also used by Globus and Legion. For the understanding of this diagram it is enough to say that the NWS Forecaster uses measurements made by the NWS Sensors. When we call a function of the NetSolve API,
The problem description file (PDF) is the mechanism through which NetSolve enables modules for the user. It gives the processing cost of the module. Using the information returned by the NWS Forecaster, the input parameters and the processing cost of the module the NetSolve Agent can select the most suitable server. For more information or to download NetSolve, look at http://icl.cs.utk.edu/netsolve/. NetSolve can use the Internet Backplane Protocol (IBP). IBP has two kinds of components, IBP clients and IBP servers (depots). Depots provide storage space whereas clients can read or append data. In the simplest scenario the NetSolve client allocates a byte array and move its data on a depot, then it uses the NetSolve API to send a request with the location of its IBP datastream. Then the module can retrieve the data from the depot. IBP also allows moving the byte array close to the NetSolve server. In a third scenario the client requires a computation that involve two modules and a depot is used to store the intermediate result. You can find documentation about IBP at http://loci.cs.utk.edu/ibp/documents/. You can download IBP from http://loci.cs.utk.edu/ibp/downloads/. In the future the NetSolve API could be replaced by a standard GridRPC API. For more information you can look at http://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/grpc.pdf. CreditNetsolve Web site requires that we add the following text when referencing NetSolve in a publication: @TECHREPORT{nug, AUTHOR = {Arnold, D. and Agrawal, S. and Blackford, S. and Dongarra, J. and Miller, M. and Seymour, K. and Sagi, K. and Shi, Z. and Vadhiyar, S.}, TITLE = {{U}sers' {G}uide to {N}et{S}olve {V}1.4.1}, INSTITUTION = {University of Tennessee}, YEAR = {2002}, TYPE = {{I}nnovative {C}omputing {D}ept. {T}echnical {R}eport}, MONTH = {June}, ADDRESS = {Knoxville, TN}, NUMBER = {ICL-UT-02-05} } PageBox GridHigh Throughput ComputingThe PageBox Grid must be easy to install and must not require a porting effort. The PageBox Grid doesn't aim to compete with low-level, high performance environments. The Pagebox Grid should provide high throughput rather than high performance. Compute-intensive jobs are not necessarily its main application. The PageBox Grid should rather allow retrieving data, performing searches, maintenance and monitoring tasks on large sets of computers in parallel. A typical PageBox grid requestor doesn't ask the grid to perform a computation but rather to return information scattered on the grid. The PageBox Grid can also be used to maintain memory caches in sync. The PageBox Grid is used in interactive mode. A computation is issued either when a user makes a request on an HTML form or when a Web service is invoked. VM basedThe PageBox Grid should be available in Java and C#. Using pre-compiled code running on Virtual machines has several advantages:
ProtocolThe PageBox Grid uses SOAP, SMTP and UDP. The C# version of the PageBox Grid uses the ASP.NET SOAP. The Java version of the PageBox Grid uses JAX-RPC. Active Web Service NamingWith PageBox, a Web Service provider can also provide a Control, which is a GUI snippet that knows how to call the Web service. The Web service user includes this control in its portal or WebTop. Because the Web Service provider publishes a new version of the control when it updates its Web service, the Web service user has nothing to do when the Web service changes. Because PageBox allows distributing any kind of Web application it allows distributing the Web services onto a large set of locations. If the provider deploys its Web services on a large but variable number of computers, it needs a way to route requests:
PrincipleLet's take an example of Web service proxy generated by wsdl.exe in an ASP.NET environment. The coding is quite similar in Java:
The Web Service provider sets the Web service URL with an MyActiveNaming method. MyActiveNaming uses the Active Naming API to retrieve available Web service instances and to select the most appropriate one. Active Naming APIThe Active Naming API is implemented in an ActiveNaming Web service defined in the PageBox namespace and hosted in the PageBox Web application. This Web service provides a method with a signature: ActiveNaming.Candidate[] GetCandidate(String repository_URL, String WebService_archive, String WebService_name); Where
GetCandidate returns an array of Candidate object. The Candidate class has this definition:
URL is the URL of the PageBox that deployed the Web service. The implementer uses it
The ActiveNaming Web service provides another method, SetKeyRange: void SetKeyRange(String keyRange); keyRange is typically used to set a key range (data depending) but it can be any data that the implementer can need to route requests to the most appropriate instance. Implementation notesThe ActiveNaming service is implemented in the same way as the Query Web service. GetCandidate uses the RepoQuery Web service of repository_URL to retrieve the list of the PageBox subscribing to WebService_archive. Then it calls a method GetKeyRange of the ActiveNaming Web service for each PageBox in the list to retrieve the candidate keyRange. If the repository doesn't answer, GetCandidate uses a cached list, which is the list returned by the previous successful RepoQuery invocation. If a PageBox doesn't answer to the GetKeyRange request, the candidate instance is considered as inactive and removed from the list. Actually GetCandidate doesn't poll the candidate PageBoxes each time it is invoked. It caches the list in memory. API principleThe API is inspired by MPI. The PageBox Grid API methods are defined in a Grid class defined in a PageBoxGrid namespace. Our first idea was to define communicator, process and process group like this:
Because we are in multithreaded environment we chose to represent a process by a Grid object. The Grid constructor takes a communicator parameter. A communicator is uniquely defined by (repository_URL, communicator). Each Grid instance has a unique sequence number in its Web application instance, itself uniquely defined by its IP address, port and mail address. A process is uniquely defined by (repository, communicator, IP address, port, mail_address, sequence number). We actually expect this programming style:
You must instantiate Grid in the created thread. The following table summarizes the previous discussion.
A process retrieves the base URL of the other processes in its group using the Query Web service of PageBox. It can call the Web services offered by the other Web application instances. The process knows these Web Service URIs because:
This mechanism is useful
SOAP has two weaknesses:
The core of the PageBox Grid API is designed to address these limitations. The typical MPI verb is MPI_SEND. It has this signature: MPI_SEND(buffer, count, datatype, dest, tag, comm); where buffer is the initial address of the send buffer, count is the number of elements in the send buffer, datatype is the type of each element in the send buffer, dest is the rank of the target process in its process group, comm is the communicator. You call the PageBox Grid API with: Grid.Send(object, dest, transport_mode); Dest is the rank of the target Web application in the list returned by the Query Web service. Transport mode can be Transport.SMTP or Transport.UDP. Send serializes the object into a binary stream and sends it with the defined transport mode, then it returns (non-blocking mode). In UDP mode the API doesn't check for transmission loss.
In case of PageBox Grids processes are created by Repositories and PageBoxes. In a Web application using VM managed code:
For these reasons:
API verbsSendSignature: int Send(Object o, int dest, int transport_mode); where
Send return -1 in case of failure. Implementation note: Send actually serializes an object containing the user object and the group id. ReceiveIn case there is no message available, Receive waits up to a time defined in the Grid constructor for a message. Signature: Grid.Message Receive(int transport_mode); Receive returns a Message structure or null in case of failure.
Grid.Message has this definition: public structure Message {Object recv;int from;int transport_mode;} where
ReceiveNonblockIn case there is no message available, ReceiveNonBlock returns null. Signature: Grid.Message ReceiveNonBlock(int transport_mode); Receive returns a Message structure or null in case of failure.
Grid.Message has this definition: public structure Message {Object recv;int from;int transport_mode;} where
ScatterScatter sends an object to all other processes in the group. Signature: int[] Scatter(Object[] os, int transport_mode); where
If there are less input objects than processes in the group, Scatter distributes the objects to the first processes. If there are more input objects than processes in the group, Scatter sends more than one object per process. Scatter returns an array of integers whose size is the number of destinations. For a destination the array entry contains:
Here is the Scatter pseudo code:
GatherGather waits for a set of objects coming from other processes. Signature: Grid.Message[] Gather(int[] dests, int transport_mode); dests is typically an array returned by a former scatter. Its size is the number of destinations. When an entry in dest contains 0 or –1, no message is expected. Otherwise Gather receives the number of messages defined in the entry. Gather returns an array of Message structures even if the timeout expires. The user can check the returned array to check if it is the case. The user can also check the type of the returned objects. SubscribeSubscribe registers a message handler. A message handler is an alternative to the use of the Receive and ReceiveNonBlock verbs. Signature: void Subscribe(GridCallback gc); The message handler must implement a GridCallback interface defined like this:
where
Notes:
Implementation notesScatter should use UDP multicast if applicable. The Grid objects are the facades to the core implementation typically managed by a singleton. The implementation typically involves:
When the destination is in the same OS process, the reference to the message object is just queued on the target Grid object (memory communication). With transport modes such as UDP and SMTP we cannot control the order of arrival. Therefore the listener:
We can compare this model with the modes supported by MPICH:
Implication on PageBoxLayered architectureWe move to a layered, modular architecture.
At the bottom, the PageBox core installs the published Web applications. The Query API allows Web application instances to find out their clones deployed by the same repository. The ActiveNaming API uses the Query API to retrieve the Web service URLs and routing information. The Grid API uses the ActiveNaming API to implement communicators and message passing. The Router component uses the ActiveNaming. It supports:
The Replicator component allows keeping in sync across Web applications cached information such as Session data. The Replicator component uses the Grid API. Active NamingActive Naming allows implementing scenario like this:
The browser page uses the Router wrapped in an applet or ActiveX control. The client-based Router uses information from the PageBox Active Naming to balance requests, select the closest Web Application instance and support fault tolerance. The Server page calls a Web Service. It uses the Router to retrieve the Web Service URL. The server-based Router uses information from the PageBox Active Naming to balance requests, to select the closest Web Application instance or a Web application that has the data and to support fault tolerance Grid APICache synchronization
When a server page modifies the cache on one Web application instance, it invokes a Replicator object. The Replicator uses the PageBox Grid API to scatter the cache change to the other Replicator instances. The remote Replicator instances update their cache. Because all Web server instances have the same cache state, you can balance requests between Web server instances. Data replication
When a server page updates the database, it invokes a Replicator object. The Replicator uses the PageBox Grid API to scatter the cache change to the other Replicator instances. The remote Replicator instances update their database instances. Updates are guaranteed. The Replicator sends update message to an SMTP server. If the SMTP is not running it can fall back to another SMTP server. If the target Web application is not running when the update message is sent, when it starts again it first reads its mails and processes its update requests. GoalOn the long term, the combination of the PageBox deployment and of the Grid for data and processing distribution should allow a Web infrastructure to act as a brain: Its Web application instances hosted on inexpensive computers would be less powerful and reliable than super servers just like a neuron is less powerful than any computer. However because each instance knows many ways to serve the user the aggregated power and reliability of the Web brain should far exceed the performance of any computer and hide network failures.
Java PHP .NET
Contact:support@pagebox.net |