Web archive deployment framework on Internet

PageBox

PageBox framework

Executive summary

I present here a new approach for software deployment whose tentative name is PageBox. It leverages on existing Java standards (Web Archives, Java Server pages, servlets) and technologies, especially:

Class Loaders
Sandboxes and Java 2 security

PageBox aims at

Providing sub-second response time to browser based applications
Reducing bandwidth need for ISP infrastructure

Its core concept is to allow Application Servers to handle Web Archives like browsers handle applets. It is implemented today as a servlet package that can run in free (Tomcat) or inexpensive (Resin) Application Servers and run on a large range of devices. It could be integrated in these products. It could also be embedded in appliances.

It is designed to be operated by ISPs. It conforms to Internet rules, no central administration and almost unlimited scalability through the use of well-defined protocols.

Though this approach implies deploying PageBox and Web Archives on a large number of computers, I show PageBox can be securely administrated and efficiently troubleshot. I also pay special attention to address security issues: Web Archives can be published only by identified entities and network traffic can be protected against tampering and eavesdropping.

Its standardization should interest companies with remote locations, customers and partners that need a sub-second response time and would like to benefit of Web applications advantages, shortened time to market, simplicity and low development cost.

It should also allow ISPs that host Web Applications to increase their revenues and software companies to develop a large range of new applications. Vendors could also sell PageBox appliances.

Table of content

1 Problem statement *

1.1 Web application *

1.2 Graphical front end *

1.3 A third way *

2 Solution based on Java and J2EE *

2.1 Implementation *

2.2 Administration *

2.3 Security *

2.4 Analysis *

3 ISP solution *

3.1 Principles *

3.2 Actors *

3.3 Analysis *

3.3.1 Web Caching *

3.3.2 PageBox integration *

3.3.2.1 Principle *

3.3.2.2 Session handling *

3.3.3 Protocols and security *

3.3.3.1 Client/server protocol *

3.3.3.2 End user security *

3.3.4 Archive publication and distribution *

3.3.4.1 Archive distribution *

3.3.4.2 Archive publication *

3.3.4.3 Charging model *

3.3.4.4 Legal aspects *

3.3.5 Reference data *

3.3.5.1 Serialized objects *

3.3.5.2 JMS *

3.3.6 Local data update *

3.3.7 Life cycle *

3.3.8 Troubleshooting *

3.3.9 PageBox API *

3.4 Advantage analysis *

3.4.1 Traffic *

3.4.2 DSL and Cable network *

3.4.3 Markets *

4 Possible standards *

4.1 PageBox *

4.2 ICP *

4.3 Publication protocol *

4.4 Summary *

5 Author biography *

Table of Figures

Figure 1: Intranet solution *

Figure 2: class diagram *

Figure 3: administration *

Figure 4: Actors *

Figure 5: Web caching *

Figure 6: Areas *

Figure 7: Client/server security *

Figure 8: PageBox distribution *

Figure 9: life cycle *

Figure 10: PageBox log display *

Figure 11: PageBox Statistics *

Figure 12: Protocol comparison *

Figure 13: Multiple ISP deployment *

Problem statement

Today to offer a Graphical User Interface a company must either:

Write a Web application or
Write a graphical front end

Web application
Advantages:
- A Web application is easier to write and to maintain than a graphical front end. It also requires less skill.
- A Web application is a central application, so it is easy to deploy and update.
Drawbacks:
- A Web application being a central application also means all application parts, presentation, business logic, data caching and accesses run on a small set of servers. Large server resources (memory, CPU and disks) are more expensive than small computers ones
- Browsers are used to display Web application pages and these pages are downloaded using HTML or XML over HTTP. Here the main drawback is that presentation is downloaded with data. As a consequence Web applications require more bandwidth than applications invoked by graphical front ends
Web Applications are successful and address well End Consumer market where availability and response time requirements are lower. The End consumer doesn’t pay nor is paid to use the application but I think the major point here is she or he is an occasional user. Compared to a Professional User, she or he is still a beginner and therefore slower.

Graphical front end
Advantages:
- From a communication point of view, a graphical front end is the client part of a client/server application. It can use client/server protocols such as EJB over RMI/IIOP, which carry only data and require less bandwidth than Web applications
- It runs presentation on the client and requires less resources on server where they are expensive
Drawbacks:
- A graphical front end is harder to develop and to maintain. It is more demanding in project management and developer skills. This complexity doesn’t accommodate time to market and frequent changes constraints
- A graphical front end is hard and expensive to deploy and update on a large number of devices

A third way

Both solutions don't satisfy Professional Users.

A third solution has been successfully deployed on Intranet.

Figure 1: Intranet solution
Here the Web Application is split in a presentation part and an application (business logic + data access) part. The presentation part is deployed in every site. The application part remains on the central site. The presentation part calls the application part using a client/server protocol such as EJB over RMI/IIOP.

This solution combines advantages of both Web Applications and Graphical Front End:
- It has the development simplicity of a Web Application
- It uses HTML/XML over HTTP only on remote location LANs. It uses a client/server protocol on WAN with minimal bandwidth requirement
- It spares resources on the application server where resources are expensive. A presentation server handles only the users in the same Remote location and can be hosted on an inexpensive machine of the same type as users’ workstation
- It simplifies deployment. Only presentation servers have to be deployed and maintained
The solution has two drawbacks:
1. It is only suitable for Intranet. Presentation servers supervision and maintenance requires manpower, skills and raises security issues
2. It is not cost-efficient when Remote location contains less than five or ten workstations

Solution based on Java and J2EE

Implementation
We developed under GNU Lesser General Public License 2.1, a simple implementation addressing the first shortcoming of the third solution.

Its principle is to allow the presentation server to act as a browser and download presentation archives as a browser downloads applets.

Figure 2: class diagram

It contains a service servlet JSPservlet, invoked by the servlet container.

Depending on the path, it looks for an archive. If it doesn’t find it creates a class loader JSPloader, which downloads the archive from a remote server. Then a ClassEntry class instantiates the requested servlet or JSP using the created class loader.

PageBox is packaged in a war file, whose init-parameters specify parameters such as the Certificate Authority and Certificate Revocation List URL.

Administration
A servlet allows administrating PageBox.

Figure 3: administration

This servlet allows:

Adding or changing managed archives at the bottom of the screen

Listing managed archives and triggering their download, update or delete

The servlet uses GET mode, so it is easy to issue administrative requests with batch commands.

Security
PageBox can use JSSE to download archives using SSL.

It also support signed archives and security using the Sun JKS key store and policy files.

When JSPloader loads an archive class:

It retrieves the certificate chain the class has been signed with

It checks the validity of the certificate chain with a Certificate Authority

It checks the certificate has not been revoked

It defines the class in a protection domain whose permissions are the permissions associated with this certificate and code source

When the class is instantiated, it runs in a sandbox and is only allowed permission it was granted in the policy file.

Analysis
PageBox is a reasonable technical answer to the problem. It is:

A simpler way to manage Intranet applications

An appealing solution for B2B communication: suppose company A has written a Web Application. It hosts the business logic and data access part on its site. Its customer, company B downloads the presentation part, which acts as a smart proxy for its internal users. It is a win-win situation: Company A runs its application on a smaller farm and company B get a better response time thank to a smaller bandwidth requirement.

But it fails to fully address shortcomings of third way.

It still requires the installation of a Java Server on each remote location

It is still unable to address efficiently smaller remote location needs (one to three workstations)

The problem is a software solution deployed on the Internet end points cannot be optimal in term of resource use. The only way to address this issue is to ask ISPs and ASPs to host PageBox.

ISP solution
1. Principles
  - PageBox is a new service the ISP charges to the publisher
  - Presentation (Web Archive) hosting is a commodity like routing or proxies
  - PageBox instances host Web Archives from different publisherspublisher publisher publisher publisher
  - ISPs host PageBox instances where it is needed on their network. Archives deployment on these PageBox instances can depend on archive use. A highly used archive will be deployed on more PageBox instances than a less used archive
  - PageBox uses existing standards, especially Web Archives and Java 2 Security
2. Actors
  Figure 4: Actors
  - Web user. She or he invokes an application using an URL
  - Publisher
  - ISP
3. Analysis
  Figure 4 arrows show the different issues that the ISP solution must address:
  1. A browser at any location must be able to invoke PageBoxed presentations with the same URL. The address part www.PBserver.com on Figure 1 must be the existing address of the publisher
  1. Web Caching
    PageBoxes will act with dynamic content like Web Caches act with static content.
    
    Figure 5: Web caching
    
    The ISP deploys one or many Entry Web caches and upper layer parent cache(s).
    
    Entry Web Caches are neighbors. For an Entry Web cache other Entry Web Caches act as sibling Web caches.
    
    When a browser doesn’t hold a page, it asks it to a local cache (if one is defined) or to the ISP Entry cache.
    
    If the local cache doesn’t hold a page, it asks it to the closest ISP Entry cache. If it doesn’t hold the page the ISP Entry cache asks its neighbors if they have the page in cache using a standard protocol, ICP defined by RFC 2186 and RFC 2187. If no neighbor has the page, the cache forwards the request to the parent cache.
    
    If the parent cache doesn’t hold the page it retrieves it from the provider site.
    
    Direct retrievals happen when the content cannot be cached.
  2. PageBox integration
    1. Principle
      The ISP divides its network in logical areas.
      
      An area must contain:
      
      Two or more PageBoxes for fault tolerance
      
      Two or more Web Caches
      
      An PageBox-enabled Web Cache processes cacheable pages as described on Web Caching section. But if the page is not cacheable, instead issuing a direct retrieval it looks in a table if the URL is handled by neighbor or upper layer PageBoxes.
      
      If neighbors PageBoxes can handle the URL, it select one of them with a round robin algorithm, otherwise it selects an upper layer PageBox. If no PageBox can handle the request, it issues a direct retrieval.
      
      This table can be build and updated by multicasting standard ICP messages ICP_OP_QUERY to PageBoxes. They answer ICP_OP_HIT if they can handle the URL or ICP_OP_MISS if they cannot.
      
      Depending on its size, the ISP can operate PageBoxes at different levels, Region, sub-region, area on Figure 7.
      Figure 6: Areas
      
      Suppose archive 1 has a high quality of service requirement or is heavily used. The ISP deploys it on areas.
      
      If it has a low quality of service requirement, the ISP deploys it on regions. In intermediate cases, the ISP deploys it on sub-regions.
  3. Protocols and security
    1. Client/server protocol
      To support client/server requests issued by Published Web Archives,
      
      The publisher must accept non-HTTP traffic coming from its ISP and carried over IPSec
      
      The ISP must send this traffic using IPSec
      
      The PageBoxes and ISP internal network is considered as secure.
      
      Figure 7: Client/server security
      
      Publishers are connected to Entry Points gateways that establish an IPSec tunnel with Publisher hosts or IPSec gateways. The creation of this tunnel implies to establish a security association (SA), so involves the use of Internet Key Exchange (IKE) between the gateway and the publisher.
      
      The IKE authentication is performed using RSA public key encryption.
      
      The repository HTTP entry point described in the coming 3.3.3.2 Archive publication section can automatically configure the gateway.
    2. End user security
      The ISP can provide an HTTP over SSL access to End-Users.
      
      In this case, the publisher must specify:
      
      If it requires SSL.
      
      If it needs to authenticate the server, it cannot assume the ISP uses the same server certificate chain in different PageBoxes. However all PageBoxes certificate chains must include the certificate of a CA the publisher trusts
      
      If it need to authenticate clients using certificates.
      
      Then the publisher is responsible to implement client certificates checking in its archive.

Possible standards

PageBox
Today PageBox is implemented as a regular Web Archive.

It implies it replicates some functions of Application Servers
- class loader
- static resource handling
It should replicate web.xml parsing (I confess it doesn’t handle it today).

Though this public domain implementation has a value in an interim phase, vendors could better support Application Servers implementing its functions, mainly:
- Archive download
- Sandboxes
It would be useful to standardize PageBox customization.

ICP
RFC 2186 and 2187.

In PageBox integration section, we saw existing ICP_OP_QUERY, ICP_OP_HIT and ICP_OP_MISS were addressing the primary need.

The definition of new messages or the modification of existing messages would however allow PageBoxes to return:
- A load factor, the Web Cache could use for load balancing
- A bulk transfer of supported URLs. The message would remain small and the Web Cache would have much less ICP messages to send
We need new ICP messages to handle sessions:
- ICP_OP_SESSION_ADD issued by PageBoxes. Its payload contains one or many Session identifiers separated by commas.
- ICP_OP_SESSION_ADD_ACK issued by Web Caches to acknowledge a ICP_OP_SESSION_ADD
- ICP_OP_SESSION_DEL issued by PageBoxes. Its payload contains one or many Session identifiers separated by commas.
- ICP_OP_SESSION_DEL_ACK issued by Web Caches to acknowledge a ICP_OP_SESSION_DEL
As ICP is used to drive PageBoxes, it can also be used to install, update or delete PageBoxes archives or to get their status:
- ICP_OP_ARCHIVE_UPDATE issued by distribution tool. Its payload is made of lines containing archive name and location separated by commas. When a PageBox receives this message, it updates its property file and downloads the archive(s). If the archive was already installed, it is an update. Otherwise it is an installation.
- ICP_OP_ ARCHIVE_UPDATE_ACK issued by PageBoxes to acknowledge a ICP_OP_ARCHIVE_UPDATE
- ICP_OP_ARCHIVE_DEL issued by distribution tool. Its payload is made of lines containing archive name and location separated by commas. When a PageBox receives this message, it updates its property file and remove the archive(s)
- ICP_OP_ ARCHIVE_DEL_ACK issued by PageBoxes to acknowledge a ICP_OP_ARCHIVE_DEL
- ICP_OP_ARCHIVE_QUERY issued by distribution tool
- ICP_OP_ARCHIVE_STATUS issued by PageBoxes to answer an ICP_OP_ARCHIVE_QUERY. It contains the list of installed and loaded archives and of their locations

Publication protocol
The value of PageBox is augmented if a company can publish a Web Archive to a single ISP and get it deployed by many other ISPs.

Figure 13: Multiple ISP deployment

For a multiple ISP deployment, we use exactly the same means as for a single ISP.

The first ISP, the ISP the publisher has published to and presumably subscribed, acts as a publisher for other ISPs. Let’s call it the editor.
When it has received a subscription request, checked the publisher credential, run the Web application in quarantine, it deploys it internally. It also publishes it to other ISPs.

Each other ISP checks its credentials, finds out it is a trusted ISP and deploys the Web Application. Its deployment involves the same steps as for the editor:
- Archive distribution
- Gateway configuration to open a tunnel between the ISP gateway and the publisher gateway or host
As a consequence, it would be useful to standardize the publication protocol

Summary
We would need to standardize:
- The PageBox customization
- The involved protocols (ICP, publication)

Author biography

I used to be the main designer and lead developer of an Intranet solution.

At this time I was working for BEA and the customer was a large French bank. It had 2000 agencies and 23000 personal computers and the solution was designed in 1998. So they are differences with Figure 1.

Presentation servers were IIS. They invoked a local Tuxedo, which was used to deliver local services and to invoke central location Tuxedo services. It meant we had to manage and maintain 2000 servers running both IIS and Tuxedo.

I started to work seriously on Application Servers in fall 1999 after moving to Amadeus. During summer 2000, I submitted an article project about class loaders to Java Developer Journal. It accepted the layout. I came back to the design above to illustrate the article and the concept turned to be more exciting than I expected. The article is now divided in three parts, the first one illustrating presentation hosting, a second one administration and a third one the security.

Installation Constellations Versions Demo
Publisher Mapper Cocoon/SOAP Security Configurator
J2EE version Embedded version Diskless version

1.	Client hello	----->
2.		<-----	Server hello
3.		<-----	Certificate
4.		<-----	Certificate request
5.		<-----	Server key exchange (Optional)
6.		<-----	Server hello done
7.	Certificate	----->
8.	Client key exchange	----->
9.	Certificate verify (Optional)	----->
10.	Change cipher spec	----->
11.	Finished	----->
12.		<-----	Change cipher spec
13.		<-----	Finished
14.		<-----	Encrypted data