Pushing real time data using HTML5 Web SocketsPushing real time data using HTML5 Web Sockets Nikolai...

Pushing real time data usingHTML5 Web Sockets

Nikolai Qveflander

August 17, 2010Master’s Thesis in Computing Science, 30 credits

Supervisor at CS-UmU: Mikael RannarExaminer: Fredrik Georgsson

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Abstract

The current browser landscape has no real support for server initiated push. Existingtechnologies such as Comet and AJAX emulate server push using ”long-polling” and rely onmaintaining two connections between client and browser for streaming. The latest HTMLstandard, HTML5, is introducing elements which will integrate web front-ends much tighterwith server back-ends. Most importantly, web sockets are now being introduced and therebyallowing browser applications to receive asynchronous updates from the server side, so calledserver push. Web sockets define a full-duplex communication channel that operates over asingle socket using HTML5 compliant browsers. Web sockets allow for true low latencyapplications and put less strain on the server.

This paper is an attempt to create a web socket server according to the HTML5 websocket standard. The server will integrate with TickCapture[18] which is an existing systemfor back-testing and trading of algorithms written in various languages. TickCapture alsoallows for low latency presentation of real time market data and this data will be pushedthrough the web socket server to the clients using regular HTML5 compliant browsers.

The finished system must provide good scalability so an in-depth study of scalabilityand load balancing techniques will be carried out to identify different solutions for myimplementation.

Contents

1 Problem Description 5

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 In-depth study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 In-depth study: Scalability and load balancing 7

2.1 Scaling up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Scaling out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Designing for scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Identifying application environment . . . . . . . . . . . . . . . . . . . 10

2.3.2 Modular code and abstraction . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Parallel processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.4 Distributed computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Load balancing using Round Robin DNS . . . . . . . . . . . . . . . . 19

2.4.2 Transport layer load balancing . . . . . . . . . . . . . . . . . . . . . . 19

2.4.3 Application layer load balancing . . . . . . . . . . . . . . . . . . . . . 22

2.4.4 Server selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.5 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Design and Implementation 27

3.1 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Theoretical study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.2 Integration with TickCapture . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.3 Examining the Websocket API . . . . . . . . . . . . . . . . . . . . . . 29

3.1.4 Client side design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.5 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Programming languages . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3

4 CONTENTS

3.2.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.1 The Websocket protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3.2 The TickCapture API . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.3 Proof of concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3.4 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.5 Threads in the system . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.6 Message route . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.7 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.8 Scaling out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3.9 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Results 55

4.1 Testing platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Testing setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.1 Scaling of a single websocket server . . . . . . . . . . . . . . . . . . . . 56

4.3.2 Scaling and load balancing with four websocket servers . . . . . . . . . 56

5 Conclusions 61

5.1 Analyzing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Improving performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2.1 Numerical representation instead of strings . . . . . . . . . . . . . . . 62

5.2.2 Eliminating message duplication . . . . . . . . . . . . . . . . . . . . . 63

5.2.3 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2.4 Fine tuning server selection . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Acknowledgements 67

References 67

Appendices 71

A Message Types 71

Chapter 1

Problem Description

In this section I will try to describe the purpose of this project, the problems that it attemptsto solve, the goals and the methods used to achieve them.

1.1 Problem Statement

The existing technologies that enable pushing of data from a server to a subscribing client arenot using true asynchronous communication. Instead they emulate this using long pollingwhere the client polls the server for data. If no data exists for the client, the server does notimmediately respond but rather waits for data to be available and then sends it to the client.This technique of delayed responses relies on always having an outstanding client requestthat the server can respond to. What might look like a server initiated communicationactually relies on the client making requests to the server.

1.2 Goals

The goal with this paper is to implement the HTML5 web socket protocol and successfullyintegrate the implementation with an existing system for presenting real time market data.Market data will be pushed to the client without any explicit request from the client allowingfor true asynchronous updates from the server. To achieve this, a number of subgoals weredefined:

– Implement a web socket server taking into account the restrictions on scalability andlicenses. The implementation must adhere to the specification of the Web SocketProtocol found at IETF [13].

– Enable integration between TickCapture and the web socket servers. TickCaptureprovides a C/C++ API that will allow for third party solutions to communicate withthe system. Market data provided by TickCapture should be forwarded to clientsthrough web sockets and presented in a browser.

– Any number of web socket servers should be able to run. Clients should be assignedin such a way that load distributed evenly across available resources.

– A simple protocol should be implemented on top of the web socket for successfulcommunication. It must be sufficiently light-weight to avoid slow client processing. It

5

6 Chapter 1. Problem Description

should also support replication between clients, i.e. update message should not containany client-specific information. Protocol must be able to handle client reconnectionsin case of a disconnect.

– Build a web based interface allowing clients to receive market data from the server.Interface should be compatible with all web socket enabled browsers. The client sideshould use the Web Socket API [12] to ensure maximum portability between browsers.

1.3 Methods

TickCapture provides a C/C++ API, so the web socket server will also be implementedusing C/C++. The code will be written using standard libraries available in both Windowsand Linux for maximum portability. Third party libraries must only be used if the licensingis permissive. GPL licenses and closed source third-party solutions are not acceptable.

Software development will be conducted in an iterative way. Work will follow a basicplan and allow for changes during the implementation if additional features are required orif performance issues with the selected design surface.

1.3.1 In-depth study

In order to fully understand the requirements and get ideas for building a scalable system,an in-depth study will be conducted. The subject that I have chosen is load balancingand scalability. Scalability is key as the system must provide ways to be extended withadditional resources as the load grows. Finding suitable techniques for load balancing isequally important to ensure that the allocated resources are used in an optimal way.

The study will contain information about available techniques and analysis of possibleimplementations that are suitable for this project.

Chapter 2

In-depth study: Scalability andload balancing

One of the most important parts of developing an application or service that will cater tomany users is to make sure that the implementation does not suffer as the userbase beginsto grow.

It is fairly trivial to create a simple server that will allow users to connect and interactwith some service provided by that server, but as the load increases and the server begins tostruggle with servicing the clients the response times will be longer. In a real time scenariowhere latency is of importance, this is not acceptable. Such a scenario is the pushing of realtime market data to a subscribing client. In order to make correct decisions based on thedata provided by the server the client must rely on it being current.

If latency would not be an important factor, the server could be allowed to delay messagepassing when the load increases and then catch up as the load goes down and send theremaining messages. For a client expecting a steady stream of messages this poses anotherproblem. The sudden flushing of messages from the server could flood the client causingunexpected behaviour, even a disconnect. If the client gets overwhelmed it cannot receivemore messages, resulting in the server having to buffer and send them when the client isready again. Without proper throttling by the server this could lead to yet another floodingof the client and essentially this would continue until the load has normalized in both clientand server.

The first natural step of increasing the throughput of a popular service where the loadhas reached close to maximum is usually to scale up.

2.1 Scaling up

Scaling vertically or scaling up simply means that the current infrastructure housing theservice is upgraded. A single server running the service can be upgraded by replacingthe CPU with a faster one, adding more or faster RAM, bigger storage devices and fasternetwork adapters. If the server is running a single service that is threaded and parallelizableit could benefit from being upgraded with additional processors.

The next step is to identify services running on the server that do not necessarily needto run on the same computer. If they are using some form of message based communicationthey could easily be assigned to unique servers without the need of any major reconfigura-

7

8 Chapter 2. In-depth study: Scalability and load balancing

Web service

Application service

Database service

User

User

User

User

Figure 2.1: A server running three services.

tion. An example would be a single computer in Figure 2.1 running a web server, applicationserver and a database server where these three services are then assigned to their own com-puters. (Figure 2.2)

Application service

Database service

Web service

Users

Web service

Application service

Database service

User

User

User

User

Figure 2.2: Scaling vertically (scaling up)

One obvious reason for using the approach of scaling vertically is that it is easy toimplement. No changes in the software implementation are required to benefit from thefaster hardware (provided that the hardware architecture is the same). However, if it is tobenefit from a multi core upgrade, the software has to parallelized.

The downside to this type of scaling is that it is expensive. The first step of assigningeach service an unique server is relatively easy, but once you have divided up everythingthat can be divided up, you will have to upgrade each server to increase the capacity. Onceyou begin to push the limits of the current hardware generation the cost of upgrading risessteeply. A top of the line CPU that offers 10% higher performance might be many timesmore expensive. The price/performance graph would not be linear. Similarly, a memoryupgrade could require a complete replacement of the current memory modules with higherdensity ones resulting in an expensive upgrade.

Scaling up does nothing to increase availability or uptime of the service. You still havea monolithic server and thus a single point of failure.

2.2 Scaling out

Another approach to introduce scaling into an implementation is scaling horizontally orscaling out. Instead of upgrading a single server with faster hardware, more nodes are

2.3. Designing for scalability 9

added to the system, in essence duplicating the existing services. An example would be togo from a single web server to three web servers, creating a cluster of web servers, and simplyadding more servers as the demand grows. Figure 2.3 illustrates the horizontal scaling ofweb servers. A load balancer has been introduced to make the connection transparent forthe clients and to evenly distribute the load among the web servers. If there is a needfor additional applications servers, a load balancer could be introduced between the webservers and the application servers making the connection between the them completelytransparent.

Introducing more database servers is a bit more complex. The database is a sharedresource and distributing the data requires that the replication among the servers is donecorrectly.

Database service

Application service

Database service

Web service

Users

Application service

Users

Web services

Load balancer

Figure 2.3: Scaling horizontally. The system has been expanded by two web servers and aload balancer.

One big incentive for selecting the horizontal approach is that the cost of hardware canbe reduced. You no longer need to acquire the fastest and most expensive hardware toexpand your system. At the very least, the expenditure should be more linear than thevertical approach.

If the system is well designed and distributable to benefit from being spread out onmultiple hardware instances the cost of expanding a system horizontally can be far less thanthe cost of replacing hardware in a single server as by the vertical approach.

Compared to vertical scaling, horizontal scaling is very hard to implement. The pos-sible savings you make by buying cheaper hardware could very well be covering some ofthe expenses for engineering time required to make the horizontal scaling work seamlessly.Which one is actually more cost effective? One could argue that modifying software to suithorizontal scaling is a one time expense and that continually replacing hardware in a singleserver is a recurring expense.

2.3 Designing for scalability

Designing software and systems for scalability is not trivial. Scalability is not the same asprocessing time. To simply lower the processing time in the system you can replace hardwarewith higher capacity hardware like a faster CPU. This will often speed up even the mostpoorly designed systems. However, replacing a CPU with a faster one does not alwaysresult in better system performance. There are many possible bottlenecks in a computer


that could make the upgrade worthless. A memory intensive application such as multimediaor 3D imaging software could have exhausted all available memory bandwidth before evenusing 50% of the total processor capacity. In that case not even more memory would help.

Optimizing the code and algorithms can address some of the bottlenecks. If the memorycan’t serve the CPU fast enough, perhaps a rewrite of that part of the code will utilize theCPU more efficiently. Eventually, you will run out of options. You can only optimize somuch of the code and there is a limit on how fast hardware is available for you to buy. Thesystem simply needs more bandwidth to increase throughput.

Instead of thinking about how fast the system will be you need to think about how manyrequests it will handle. Think of it as a road. You should not try to make the cars go as fastas possible, instead you should focus on making it possible to add more lanes to increasethroughput.

Adding more resources will not always result in better throughput. There will eventuallybe bottlenecks in all systems. Anticipating what part of a system will become the bottleneckfirst comes down to how well the application environment has been analyzed.

2.3.1 Identifying application environment

One of the first steps in designing software is to identify the application environment [7].For existing systems it is important to classify all components and really understand howthey relate to each other. What are the requirements, what can and can not be altered.Identifying the type and volume of transactions in the system and which ones that yourcomponent will come in contact with.

What are the workload conditions that the system will operate under. What parameterswill affect system load and how is the performance going to be measured. Implementing theweb socket server will require me to take into account the following variables:

– Number of users

– Transaction volume (messages per second)

– Data volume (bytes per second)

– Message processing latency

Probably the most important performance indicator will be latency. Latency will be mea-sured as the time injected by the web socket server in the passing of messages from TickCap-ture to the web browser. If the web socket server becomes overwhelmed it cannot dispatchmessages fast enough to clients which will result in higher latency.

2.3.2 Modular code and abstraction

Modular code will make it possible to swap code in the system with new code withoutthe worries of breaking other parts of the system in the process. It also provides for easyoptimization of individual parts of the system, but you should never sacrifice modularity forperformance and never spend time fine tuning code that is rarely executed.

Abstraction is important. The higher levels in your code should abstract things likedatabase access. There should be no low level access to a database, only simple functionsthat get and set data in the database without knowing the underlying structure. In the firstversions of your system everything might be housed on the same computer and you writeyour code specifically for this scenario. But what happens when the system needs to scale


out and the database has to move to a separate server or even be exchanged for anothertype of database with different access syntax? This means that you have to rewrite all yourcode that deals with the database access and possible breaking something in the progress.

Database

Program

Connect

Get

Set

Server 1

Program

Server 1

Database

Database server 1

Database

Database server 2

Database handler

Get

Set

Connect

Connect

Figure 2.4: Abstracting the database access allows scaling of the system. Database accesshas been moved from the main application to a database handler.

By abstracting the database access with a database handler like in Figure 2.4 you canscale the system more easily. If the location and type of the database changes, no change isrequired in the main program to deal with this. Instead, the database handler is modifiedto handle the access to the databases correctly. Even adding more database servers iscompletely transparent to the client. The database handler can handle all the partitioningof the data.

Well defined abstraction layers will aid the deployment of the modules and allow forgood scalability.

2.3.3 Parallel processing

Identify what parts of the code that can be run in parallel and split data processing intosmaller parts that can be assigned to pools of threads doing the work. Differentiating thedifferent tasks that an application will have will help in parallelizing. In the case of theweb socket server receiving update messages from TickCapture, one thread could do thework of receiving the messages, a pool of worker threads would then process the messages,sending them to an output queue and finally a thread would dispatch the messages to allthe subscribing clients.

The maximum expected improvement to an system that parallelizing of a part of thesystem can be calculated by using Amdahl’s law [6], also known as Amdahl’s argument.When adding multiple processors to a system to increase the performance the speedup of aprogram is limited by the time spent in sequential parts of the program.

If P is the portion of a program that can be parallelized and (1 - P) is the portion thatremains serial, then the maximum speedup that can be achieved using N processors is


TickCapture

Receive message

Message parser

thread

Message parser

thread

Message parser

thread

Message parser

thread

Message parser

thread

Send message

(1-p)’

(1-p)’’

p

Figure 2.5: Program execution. p is the parallelizable portion of the program and 1-p is thesequential portion.

Speedup =1

(1− P ) + PN

(2.1)

If 70% of the program can be run in parallel (P = 0.7) the maximum expected speedupwith 4 processors would be

Speedup =1

(1− 0.7) + 0.74

= 2.105 (2.2)

One can see that PN will approach 0 when N tends to infinity so the maximum speedup

will move to 1(1−P ) when an infinite number of processors is used. In the case of P = 0.7,

this number will be ˜3.33.

I mentioned earlier that when talking scalability we should not focus on how fast wecan process request but rather how many. Amdahl’s law focuses on the speedup on a fixedsize problem, it does not scale the availability of computing power when the number ofprocessors or machines increase. It implies that the sequential part of a program does notchange when the number of processors increases. The parallel part is evenly distributedacross the processors.

Gustafson’s law [10] is an attempt to redirect the focus from a fixed size problem. Insteadit proposes a fixed time concept to calculate the speedup. The sequential portion of theproblem is assumed to be constant but the overall problem size grows proportionally to thenumber of processor cores.

If n is a measurement of a specific problems size and the execution of a program on aparallel computer is defined as


1 = a(n) + b(n) (2.3)

where a is the sequential portion of the program and b is the parallel portion. The parallelpart is evenly distributed among the processors.

On a sequential computer, the relative time would be

1 = a(n) + p ∗ b(n) (2.4)

where p is the number of processors when running the program on a parallel computer.If the parallel computer had 4 processors, the sequential computer would need 4 timeslonger to execute the parallel part of the program. The speedup S(p) is defined as theparallel part relative to the sequential part.

S(p) =sequential

parallel=

a(n) + p ∗ b(n)

a(n) + b(n)=

a(n) + p ∗ b(n)

1(2.5)

and substituting b(n) with (1 - a(n)) we get

S(p) = a(n) + p ∗ (1− a(n)) (2.6)

Gustafson’s law assumes that the sequential portion diminishes with the increasing problemsize n, so that S(p) will approach p as n approaches infinity.

The assumption that problem sizes grow with the number of processor has its limitation.Some problems do not have datasets that can grow that much. A problem doing somecalculations for each human being in the world would only increase the problem size by afew percent each year as the population grows.

Both Amdahl and Gustafson state the theoretical performance in perfect world scenariosnot taking into account real world limitations that affect scalability. Threading an appli-cation is not trivial. It adds complexity and overhead. Both Amdahl’s and Gustafson’slaws should add this variable to their formulas, subtracting for the overhead introduced bythreads. The overhead is made up of several parts:

– Creation, switching and destruction of threads. These actions require clock cycles andif used incorrectly, can introduce a performance penalty. Reusing threads is a goodway to avoid unnecessary creation and deletion.

– Synchronization. Shared objects are protected by critical sections and if unique locksare used, the program execution can become more sequential as the threads have towait for each other.

– Load imbalance. Not all threads have the same work load. One thread could bedoing some heavy message parsing and another thread simply dispatches the resulting


messages to waiting clients. The dispatching thread would probably be more idle thanthe first, resulting in lower application efficiency. Finding good ways to distribute loadamong threads is an important factor.

Init Process dataDisplay

resultWrite dataRead data

InitDisplay

result

Single

thread

Thread 1

Thread 2

Thread 3

tserial

Tthreaded + toverhead

Overhead

Read

data

Write

data

Process

data

InitDisplay

resultThread 1

Thread 2

Thread 3

tthreaded toverhead

Read

data

Write

data

Process

data

tserial – (tthreaded + toverhead)

Figure 2.6: Overhead caused by threading of serial implementation.

To illustrate the various components of overhead, consider Figure 2.6. The top figureshows the serial implementation. The middle figure is the threaded implementation includingoverhead and the bottom figure shows a perfect world scenario without overhead. The darkerareas indicate overhead. Between initialization and reading of data there is overhead causedby thread creation, after processing the data the threads need to write the results to a sharedresource, and overhead here consists of synchronization between the threads and threadswaiting for the other threads to complete their writes so that final results can be displayed.Also note the load imbalance between threads in the middle figure. Thread 3 completes itstasks before threads 1 and 2.

To utilize additional processor cores, more threads have to be introduced in the ap-plication. With every thread the overhead increases and can become a significant part ofexecution time.

2.3.4 Distributed computing

When the program has been parallelized the next step is to scale out. You want multipleinstances of the program to run concurrently on separate servers, a so called server farm.The server farm approach requires the developer to design applications that can be deployedindependently. If the application is stateless, the deployment is simple. No transactional


state is saved so every instance can handle requests independently. Whenever you needmore capacity you just add more servers and deploy the application.

The complexity increases when the application you are deploying is stateful, it containssome sort of session data to store the state between the client and server. For the web socketserver, this state is a table of the latest stock values sent to the client and also authenticationdata. To avoid this complexity, the responsibility of keeping track of state can be put onthe client. This exposes some problems.

– Exposing internal data. One must make sure that nothing sensitive is stored on theclient side and if the data can be changed, the server must detect any maliciouschanges. Consider a cookie containing shopping cart data with item prices included.A malicious client could attempt to change the price of an item before making thepurchase. The server must never rely on data provided by the client. Encryption andchecksumming of session data can be used to make it harder to modify the informationbut this calls for additional data integrity and sanity checks in the server.

– Data size. Storing large amounts of data on the client side is both impractical andeven impossible due to protocol limitations. Transferring session data from the clientalso causes overhead and is not efficient.

For the above reasons I decided to store all client state data on the server side in myimplementation of the web socket server.

Storing client state data on the server side in a distributed system is not trivial. Witha single server, keeping track of the state is easy. All clients reconnect to the same loadbalancer assigned websocket server and the client state is available at that server. Thedistributed computing model calls for multiple servers so how do we make sure that theclient state stays consistent when the client is communicating with a cluster of servers?

– Session server (Figure 2.7). A single point could be used to store the client state.Every server would then have access to the same client state data. The disadvantageswith this solution is that it introduces a single point of failure and with many requests,the single session server acting as a shared resource could quickly become overloaded.

– Broadcasting (Figure 2.8). The servers serving the clients could use broadcasting toexchange client state date. This means sending update messages across the networkto the different servers and in a big cluster, the number of messages could quicklyintroduce significant overhead and inefficiency.

– Distributed cache (Figure 2.9). Memcached [8] is a popular solution that creates a largecache across several machines. The cache uses partitioning to store data and is placedin global space so that the cache is accessible from all clients. Consistent hashing isused to determine which server houses a certain cache entry. The hash code is alwaysmapped to the same cache server by the client. This type of distributed cache couldbe used to store client state data and it would ensure that all instances of the serverapplication have access to that data. If one server containing a partition of the cachefails, the other servers keep working so there is no single point of failure. The cacheddata in that server is obviously lost but that can be solved by introducing replicationso that the data in several locations to have a level of redundancy.

All of the above solutions have one common disadvantage: They all introduce additionalnetwork traffic. This is often not a big problem as the systems are many times housed ina single location with a fast internal network with low latencies. But if the size of stored


Users

Load balancer

Application servers

Session server

Figure 2.7: Storing session data using a session server.

Users

Load balancer

Application servers

BROADCAST

Figure 2.8: Broadcasting session data across servers.

client data increases, latency can become a problem, even more so if the servers are locatedat geographically different locations.

To eliminate the problem of sending large amounts of session data between servers wecould make sure that every client gets mapped to the same server every time it makesrequests. This is called client affinity. If the client uses the same server every time, we canstore all client state data on that server thus eliminating the need for replicating among theservers.

Figure 2.10 shows what happens when a client connects to the web socket that I haveimplemented in this work. First, it connects to the load balancer. The load balancer usessome internal algorithm for server selection and sends a health request to the selected server.If the server responds, it then proceeds to send a load balancer generated cookie to the server.The load balancer then waits for the server to acknowledge the cookie and then sends theserver information along with the cookie to the client.

The client then uses this information to connect to the assigned server and authenticatesitself with the cookie that is now stored in the server as it is anticipating a connection fromthis client. If everything checks out, the server begins to push market data to the client.

If the client disconnects from the server for an arbitrary reason, the cookie along withthe session state will remain in the server for a selected period of time until it is discarded.This allows the client to reconnect to the server and resume the previous session from the


Users

Load balancer

Distributed cache

User 1 data

User 4 data

User 9 data

User 2 data

User 3 data

User 7 data

User 5 data

User 6 data

User 8 data

Application servers

Figure 2.9: Using a distributed cache with hashing to store session data.

last known state. The state that the web socket stores contains the last values sent to theclient for each stock. When the client reconnects, the server calculates a delta, the differencebetween the current market state and the last values sent to the client and only sends thosevalues that have changed.

Application

server 1User 1 Load

balancer

Application

server 3Connect

CLIENT INFO: User1

IP: 182.44.122.20

Cookie: AF61F99D

Application

server 2

Cookie ACKSERVER INFO: App 1

IP: 130.223.238.112

Cookie: AF61F99D

Connect

Check health

Cookie: AF61F99D

Cookie ACK

Send data

Health status

CLIENT: User1

IP: 182.44.122.20

Cookie: AF61F99D

SERVER: App 1

IP: 130.223.238.112

Cookie: AF61F99D

Authenticate

SESSION: User1

Lastvalues:

Stock 1: 10

Stock 2: 25

etc

Figure 2.10: Client connecting to system for the first time.

When the client reconnects in Figure 2.11, it bypasses the load balancer and connectsto the server directly using the old cookie. If the cookie has not expired in the server, theclient is accepted and the session is resumed. If the server fails to respond, the client willattempt to reconnect to the load balancer. Obviously, if the server is down, the session stateis lost and the client has to be authenticated by the load balancer.

I have selected client affinity in my implementation to ensure that session states arehandled correctly. It leaves out a lot of the complexity that a distributed cache implies.


Application

server 1User 1 Load

balancer

Application

server 3

Connect

Application

server 2

Cookie: AF61F99D

Cookie ACK

Send data

CLIENT: User1

IP: 182.44.122.20

Cookie: AF61F99D

SERVER: App 1

IP: 130.223.238.112

Cookie: AF61F99D

SESSION: User1

Lastvalues:

Stock 1: 10

Stock 2: 25

etc

Figure 2.11: Client reconnects to the system bypassing the load balancer.

It also does not prevent scalability like having a single session server would. Broadcastingsession states between the servers would mean that messaging between servers would growexponentially as the number of servers increases.

My approach to the problem removes some of the load on the load balancer as it is notinvolved in any client reconnects. The drawback is that there is a slight potential that theload can be unevenly distributed when the load balancer does not have the chance to redirecta reconnecting client to a possibly less loaded server. The data sent from the server to theclient does not pass the load balancer so it can’t store any client session state. Obviously itcould remember which server the client got assigned and ask that server to provide the clientsession data to the load balancer which then relays it to the new server. This would meanadditional complexity and more importantly, extra load on the balancer and the servers. Itwould also delay the client reconnect.

2.4 Load balancing

When the system has been distributed across the server farm a new problem arises. Howshould the system distribute the clients evenly across the available resources? A very basicapproach would be to provide the client with a list of available servers. The client wouldthen choose (on random) a server to connect to.

Allowing the client to decide what server to use is often not a good idea. A seriousimbalance in server utilization may occur. One server could end up with all clients and therest of the servers could be unused. No, some sort of load balancing is required.

The OSI model [19] is an abstract description for computer network protocol design.It specifies seven layers that in turn are defined as host and media layers. The differentlayers contain protocols and services that provide similar functions. Instances in the layersprovide services to the instances in the layer above and request service from the layer below.Each layer provides encapsulation for the layer above and the higher you get in the layerhierarchy, the more information is available about a packet.

Performing load balancing on a lower level in the OSI model is less demanding. Loadbalancing in the transport layer is less complicated and resource demanding than doing itin the application layer. The transport layer only contains information about connectionsbetween hosts, no information about the actual applications that are communicating andthe content of the messages.

2.4. Load balancing 19

The OSI ModelData unit Level Layer Protocols

Host layersData

7 Application NNTP, SIP, SSI6 Presentation MIME, XDR5 Session Pipes, NetBIOS, SAP

Segments 4 Transport TCP, UDP, SCTP, SSL,TLS

Media layersPacket 3 Network IP, ICMP, IPSec, IPXFrame 2 Data link ARP, CSLIP, SLIP, Ether-

net, PPP, PPTPBit 1 Physical USB, Bluetooth,

802.11a/b/g/n, RS-232

Figure 2.12: The 7 layer OSI Model.

2.4.1 Load balancing using Round Robin DNS

This simple load balancing technique is performed in the application layer; DNS is anapplication layer protocol. The approach is to use DNS A records with round robin accessto a number of configured servers. One good example using this technique is Google. If youtype host -t a www.google.se in a command prompt you get a list of all A records associatedwith that address. Each time you run the command and query the domain name server,the list gets rotated so the next time another IP address will be returned by the query.

server:~> host -t a www.google.se

www.google.se is an alias for www.google.com.

www.google.com is an alias for www.l.google.com.

www.l.google.com has address 74.125.43.106






The DNS based load balancing method is simple but it has limitations. The techniquehas no knowledge of server availability and if one server goes down it keeps on sending clientsto that server. Also, it does not take care of DNS caching done by other name servers oreven the client. The IP returned by the request can be cached by other name servers andthe requests may not even reach the load balancing DNS server so the clients will get thesame IP address with every lookup.

2.4.2 Transport layer load balancing

Layer 4 in the OSI model shown in Figure 2.12 is the transport layer. Typical examplesof layer 4 protocols are TCP (Transmission Control Protocol) and UDP (User DatagramProtocol). This is where many hardware based load balancers work. They are highlyspecialized ASICs (Application Specific Integrated Circuits) that can perform a limitednumber of tasks (rewriting network packets) very fast compared to a general purpose system.The drawback of these kinds of hardware solutions tends to be the cost.


IPVS (IP Virtual Server) [3] is a software based load balancer using layer 4 switching. Ithas several approaches to enable load balancing in the Linux kernel. The machine runningIPVS acts as a load balancer coordinating a number of servers doing the actual work. Theclients only know one IP address (VIP) and make all requests to it, not knowing aboutservers in the cluster.

NAT

Network Address Translation enables the use of a single external address that is assigned tothe load balancer. The address that the clients use to access the system is called the VirtualIP (VIP). The servers sit on a private subnet behind the load balancer and are not directlyaccessible from the outside world.

Internet

Client

IP: 81.172.168.232. Request

DST: 192.168.0.4

SRC: 130.239.10.1

1. Request

DST: 130.239.10.1

SRC: 81.172.168.23

3. Response

DST: 130.239.10.1

SRC: 192.168.0.4

4. Response

DST: 81.172.168.23

SRC: 130.239.10.1

Server

IP: 192.168.0.2

Server

IP: 192.168.0.3

Server

IP: 192.168.0.4

Load balancer

VIP: 130.239.10.1

IP: 192.168.0.1

Figure 2.13: Load balancing using NAT.

Incoming request packets from the outside world are rewritten by the load balancer withthe destination IP of the selected server and dispatched on the private subnet as shownin Figure 2.13. The load balancer acts like a firewall/gateway between the two networks.Indeed, the servers must set the load balancer as their default gateway. Responses from theservers are sent through the load balancer which sets the source address of the packet to itsown IP address. Characteristics of NAT based load balancing:

– Provides good security as all packets must pass through the load balancer.

– Relatively easy to implement, the servers can run any type of operating system withoutspecial protocols or modification. Only one public IP address is needed for the loadbalancer and the servers can use private IP addresses.

– Both request and response packets have to be rewritten by the load balancer increasingthe load.

– Load balancer must maintain a session table of all connections.

– All machines must reside in one logical network, preferably with non routable privateIP addresses.


IP Tunneling

The tunneling method requires that the servers have setup tunneling devices with the sameVIP. The load balancer has the same VIP. The tunneling devices must not respond to ARPrequests to ensure that MAC address of the interface of the load balancer is identified as inthe ARP table of the main router connecting the network to the Internet. Otherwise therouter would have multiple interfaces sharing the same VIP.

Internet

Client

IP: 81.172.159.23

1. Request

DST: 130.239.10.2

SRC: 81.172.159.23

2. Encapsulated request

DST: 182.133.159.12 (130.239.10.2)

SRC: 130.239.10.2 (81.172.159.23)

3. Response

DST: 81.172.159.23

SRC: 130.239.10.2

Server

eth0: 81.172.113.19

VIP: tun: 130.239.10.2

Server

eth0: 31.72.113.23

VIP: tun: 130.239.10.2

Server

eth0: 182.133.159.12

VIP: tun: 130.239.10.2

Load balancer

eth0: 130.239.10.1

VIP: eth0:0: 130.239.10.2

Figure 2.14: Load balancing using IP tunneling.

Figure 2.14 shows how packets sent to the VIP arrive at the load balancer. They areencapsulated in a IP-packet with the real IP address of a server set as the destinationaddress. The server then decapsulates the packet and processes the request. The servercan now send the response back to the client without the load balancer having to processthe packet, so called direct return. It sets the source address of the packet to the VIP.Characteristics of IP tunneling:

– Harder to implement. Every server has to be configured with a tunneling device.

– The servers can have any real IP and be geographically distributed.

– The load balancer only has to process request packets.

Direct routing

Direct routing is similar to tunneling. Every server shares the same VIP with alias devicesand the alias devices are configured not to answer any ARP requests. Similarly as tunneling,the MAC of the load balancer is associated with the VIP in the main router so all packetsfrom the clients are sent to the load balancer.

Instead of encapsulating the request in a new IP packet the load balancer simply rewritesthe MAC address of the Ethernet frame and sets it to the MAC address of the server. Thepacket is sent to the server. Any response can then be sent directly to the client withoutpassing the load balancer (Figure 2.15). Characteristics of direct routing:

– Every server has to be configured with an alias device.

– The servers must be in the same physical network.


Internet

Client

IP: 81.172.159.23

1. Request

DST: 130.239.10.2

SRC: 81.172.159.23

2. Request

DST: 130.239.10.5

SRC: 81.172.159.23

3. Response

DST: 81.172.159.23

SRC: 130.239.10.2

Server

eth0: 130.239.10.3

VIP: eth0:0: 130.239.10.2

Server

eth0: 130.239.10.4

VIP: eth0:0: 130.239.10.2

Server

eth0: 130.239.10.5

VIP: eth0:0: 130.239.10.2

Load balancer

eth0: 130.239.10.1

VIP: eth0:0: 130.239.10.2

Figure 2.15: Load balancing using direct routing.

– The load balancer only has to process request packets.

– No modification has to be made on IP-level, the load balancer rewrites the MACaddress of every request.

– Less overhead since no packets have to be encapsulated.

Direct routing provides the best performance of the three techniques mainly because of thesmall overhead introduced routing packets.

2.4.3 Application layer load balancing

The transport layer load balancing does load balancing without knowing application data.It simply distributes TCP connections over a range of servers without taking into accountapplication premises. With an application having to deal with client session data and clientaffinity, efficient load balancing on the application layer becomes important.

Application layer load balancers can be used together with transport layer load bal-ancers. Introducing them in between the servers and the transport layer load balancersallows spreading of connections across the application layer load balancers which then de-cide what server to redirect the request to.

The overhead of parsing requests in the application layer is high thus limiting scalabilitycompared to load balancing in the transport layer.

2.4.4 Server selection

Transport layer

Server selection by a transport layer load balancer can be performed with various algorithms.In contrast to application layer load balancing, the load balancer selects server based onknown network parameters. Some of the algorithms that IPVS has implemented:

– Round Robin Scheduling. This is a simple algorithm that assigns jobs to the serversequally and sequentially. The mathematical form for selecting a server is i = (i +


1) mod n where n is the number of servers. IPVS has implemented an extra conditionof removing a server from the selection if it is not available.

– Weighted Round Robin Scheduling. It adds a weight to each available server to handlethe differences in processing capabilities that can exist in a cluster of servers. Serverswith higher weights will be assigned clients before those that have less weights.

– Least-Connection Scheduling. The load balancer counts the number of connections toeach server. In contrast to Round Robin, this is a dynamic scheduling algorithm as ithas to count the number of connections that each server has to estimate the load at agiven time.

– Weighted Least-Connection Scheduling. Just like Weighted Round Robin, each servercan be assigned a weight depending on its capacity.

– Shortest Expected Delay Scheduling. This algorithm assigns clients to servers withthe shortest expected delay. The formula is ED = (Ci + 1) / Ui where ED is theexpected delay that the job will experience, Ci is the number of connections on ithserver and Ui is the weight of the ith server.

– Never Queue Scheduling. When there is an idle server available, always send the jobto that server instead of waiting for a faster one. If there is no idle server, ShortestExpected Delay Scheduling is used.

Application layer

Neither Round Robin DNS load balancing or transport layer load balancing take the con-tents of the request into account; they are not content-aware. No load balancing decisionsare made based on the content of the request. In addition to the techniques of selectingserver that exist for the transport layer, inspecting the application layer adds the followingpossibilities.

– Server specialization. Servers have different capabilities in terms of CPU, I/O through-put and storage. Some servers are more suited for transactions and others can act asmassive storage systems. You can redirect requests to static content to less capableservers and requests to dynamic content to more powerful servers.

– Business load balancing. Some customers might be more valuable than others and youwant to make sure that they are redirected to the fastest servers available.

– Anticipating load. By inspecting the request you can anticipate the the amount of loadthe request will generate. If it is a request that will take a long time to complete youmight want to redirect it to your most powerful server depending on how importantthe client is.

Server agent

So far all the server selection techniques have relied on the load balancer keeping track of thenumber of connections made to a server and in the case of application layer load balancing,inspecting the packets and analyzing the client requests. These are all known variables tothe load balancer from the fact that it does packet inspection and rewriting, it does notrequest this information from the servers.


Feedback and status updates from the servers will allow the load balancer to make moreeducated server selections. A server agent is a monitoring software running on the serversthat collects performance and load data. The server agent then sends this data to theload balancer to inform it about the real time status of the server. Using a server agentis important when direct return is used, where response packets from the servers to theclient do not pass through the load balancer, like in the case of IP tunneling (Figure 2.14)and direct routing (Figure 2.15). The load balancer has no way of knowing the bandwidthconsumed by the the server response to the client.

Making server selection based on information provided by the server agent must be care-fully thought through and weighting of load parameters is crucial to have a good indicationof the aggregated server load. Some of the parameters to consider:

– CPU usage

– Bandwidth usage

– Memory usage

– IOPS (Input/Output Operations Per Second)

The server agent could also monitor the services running on the server and report whichservers are up and running and which ones seem to have failed. The server agent providesbasic health monitoring to the load balancer.

Internet

Client

StatusStatus

Server agent Server agent

Server

IP: 192.168.0.3

Server

IP: 192.168.0.4

Load balancer

VIP: 130.239.10.1

IP: 192.168.0.1

Figure 2.16: Server agent running on the servers and reporting back server status and loadto the load balancer.

2.4.5 Availability

A single load balancer introduces a single point of failure. If the load balancer fails, clientscannot reach the servers. That is why a hot spare is used as a backup. It often sits unuseduntil it is needed when the master balancer fails.

Both active and spare load balancers monitor each others health with VRRP (VirtualRouter Redundancy Protocol). The monitoring can be performed over a separate serialcable or over the network. If the active load balancer fails to respond to heartbeat messagesthe spare machine will take over the VIP assuring availability of the system.


Active load balancer

VIP: 130.239.10.1

IP: 192.168.0.1

Internet

Client

IP: 81.172.168.23

Backup load balancer

VIP: X.X.X.X

IP: 192.168.0.2

Heartbeat

Server

IP: 192.168.0.3Server

IP: 192.168.0.4Server

IP: 192.168.0.5

Figure 2.17: A backup load balancer is sitting unused, ready to take over the VIP of theactive load balancer if it stops responding to heartbeat messages.

Monitoring the servers can be done by sending simple requests to them. In the case oftransport layer load balancing, the load balancer could use ICMP ping or connect to anopen socket on the server to see if it is alive. If the ping request times out, the server isassumed offline.

A simple ICMP ping will however say nothing about the application status. Even ifthe server is responding to pings, the application could have crashed and the load balancerwould wrongly assume that the service is still up. In the case of a web server, the loadbalancer could make a connection to the web server port and issue a HTTP GET requeston a known page. If the server responds with the correct information, the application isfunctioning.

One server does not necessarily run only one service so just checking one server to see ifthe server is functioning can sometimes not be enough. If one of two services running on theserver is still functioning and the services are independent, the server could still be used.This is why advanced scripts are often used to monitor the uptime of services.

The goal of this work was not to ensure 100% availability of the web socket server.Considering the complexity of that problem it would almost need a report by itself. Mycurrent solution has a single load balancer which accepts client connections and assigns theclients to available servers. The servers are configured to connect to a load balancer on agiven address and in the case of a disconnect, the servers will try to reconnect forever. If theload balancer fails, some external network logic could reassign the IP address of the failedload balancer to a backup load balancer. The servers would then reconnect to this new loadbalancer and register themselves as available. The clients would also connect to the backupload balancer and the system would be up and running again. This reassignment of the IPaddress would be completely transparent to the servers and clients and requires no extraconfiguration for them to work. Like I said, such an implementation is beyond the scope ofthis work.

Chapter 3

Design and Implementation

In this section I will explain how the work was conducted and what preparations weremade and the motivation for the various choices throughout the design. I will begin bydescribing the preliminary preparations and the subsequent chapters will describe the actualimplementation of the system.

3.1 Preparation

Preparation, I have often said, is rightly two-thirds of any venture.

- Amelia Earhart

3.1.1 Theoretical study

The integral requirements for this project stated that the final implementation would pro-vide good scalability and performance. In order for the system to be easily extended withadditional resources, the implementation would have to be designed in a modular way sothat it could be distributed among any number of physical computers. The complexity ofthe underlying architecture should be hidden from the clients, therefore a way of distributingthe clients and load among the available resources would also have to be implemented.

I chose load balancing and scalability as the subject for my in depth study. Reading andand analyzing the various methods of load balancing would give me ideas on how to designthat part of the system. Designing for scalability should provide the necessary informationon how to implement the actual application that services the clients.

Scalability

Analyzing the work that the application actually has to perform and identifying the workingcondition is very important to achieve good scalability. Before expanding the system ontoadditional computers, the application has to make use of the available resources of thecurrent computer. To achieve this, I would use the following approaches:

– Thread the application

– Allow for multiple instances on the same computer

27

28 Chapter 3. Design and Implementation

Threading the application into n parts where n is the number of cores in the computerwill allow the operating system to run the threads on individual cores. Naturally, havingless threads than the number of cores will always render in inefficient use of the computingresources by the application.

Allowing multiple instances of the application to run on the same computer would alsoutilize available resources better but would also allow the programmer to simplify the dataprocessing into a single thread if he wanted to. Duplicating the entire application is a wasteof resources, albeit a less serious one. Instead of just duplicating the data processing threadsyou duplicate the entire application with a bigger memory footprint.

To provide maximum efficiency I made the decision to have the best of both worlds. Theapplication would be threaded and it would be possible to run multiple instances of it onthe same computer.

Load balancing

The possibility for multiple instances of the application and the requirement that the archi-tecture would be invisible to the client added another complexity to the system. Distributing(and in the end balancing) the load among the instances had to be implemented. The indepth study brought up the possibility of having a load balancer that would do all thedecision making and also provide a single point of entry for the clients.

This single point of entry could quickly become a bottleneck for the system so I hadto analyze and carefully plan what traffic would pass through the load balancer. If alltraffic should be routed through the load balancer, the amount of data being pushed fromTickCapture would increase the load on the load balancer linearly with every client. Also,the geographical distribution of clients, servers and the load balancer could mean inefficienttraffic routes and increase message latency. Imagine having the load balancer in Australiaand the server and client in Sweden. All traffic between client and server would then haveto be routed from Sweden to Australia and back.

A more efficient way to implement the load balancer would be to only allow for trafficbetween client and load balancer in the initial stage. The client would connect to theload balancer and receive connection details for the selected websocket server and thenconnect to the service. The subsequent client-server interaction would then proceed bydirect communication, never routing traffic through the load balancer.

I considered the possibility to use a decentralized system, in essence a peer-to-peerimplementation where the clients would have a list of available servers and then throughan internal algorithm they would connect to a server in the system. Immediately there aretwo problems with this approach. How would the list of available servers be updated in theclients? What would happen with a rogue client that completely ignores the server selectionalgorithm? The benefit of having a decentralized system is that there is no real single pointof failure.

The requirements of this project do not include uninterrupted system uptime. Having asingle load balancer implies a single point of failure. An offline load balancer means that noclients can reach the available servers. The in depth study identified ways to counter thisproblem by having backup load balancers and in the case of failure, network logic wouldtake care of routing traffic from the primary load balancer to the backup. This is beyondthe scope of this work and no effort was made into implementing such a backup system.

3.1. Preparation 29

3.1.2 Integration with TickCapture

The result of this project will integrate with TickCapture, the existing system for processingmarket data. The initial integration depends on receiving market updates from TickCaptureand also sending messages to TickCapture. An API exists for communicating with thevarious parts of TickCapture and it is available for C++. This influenced my choice oflanguage as I wanted the integration to be as seamless as possible.

The subsystem that the websocket implementation would communicate with is the Real-TimeManager which provides access to real time market data. There were many examplesprovided that allowed me to see how the interaction with the system was done. I willdescribe the actual steps to integrate with RealTimeManager later in this chapter.

The API defined update messages passed by RealTimeManager as containing the nameof the subscribed subject (bond) and a string representation of the fields and their values.TickCapture does provide decoding using field enumerations but I realized this too late.Although not the best choice, the string representation was selected which performancewise is a weaker choice because of all the extra string parsing that has to be done. Thisparsing had to be implemented in the websocket server. The update messages could verywell have been routed through the websocket server and directly to the clients without anyparsing done by the server but the requirement that clients would be able to disconnect,reconnect and even throttle messages meant that parsing the update messages was essentialin order to keep client states.

3.1.3 Examining the Websocket API

The Websocket API and protocol are part of the HTML5 standard but the specificationis still in the draft stage. It evolves and changes on daily basis but the fundamental partsin the specifications has remained the same so far. In essence, a websocket is a regularsocket with an additional layer of handshaking and message passing. The protocol definesthe handshaking process and the framing of messages sent through the socket.

As the protocol is still in draft I decided that I will only implement the very basics ofthe protocol. This will allow a regular browser with HTML5 support to connect to mywebsocket server and send/receive messages. At the time of writing, the browser of choiceis Google Chrome. Firefox and several other browsers are expected to get full websocketsupport very soon.

Since the protocol is new there aren’t many examples for me to look at so I really hadto dig in and truly understand the protocol standard. In coming chapters I will describethe Websocket protocol.

3.1.4 Client side design

The client used to present the real time market data is a simple web browser with HTML5compliance viewing a web page. The browser that I am using in my development is GoogleChrome. Since the websocket API and protocol are being developed by Ian Hickson atGoogle Inc, one would anticipate that their browser would have good support for websockets.The important features of the client side are:

– Be able to create and maintain a connection to a websocket server.

– Present market updates immediately and in real time as they are pushed from theserver to the client.


– Be able to subscribe to any number of subjects. Tables and other means of datapresentation are not limited to a specific number of subjects.

– Allow for user interaction. Client must be able to randomly connect, disconnect andthrottle the rate of messages being pushed by the server.

– Present real time server statistics like CPU usage, RAM usage and bandwidth alloca-tion.

Presenting real time data in a browser is challenging. I had to choose a suitable client sidelanguage that could handle all the above and these additional requirements:

– Supported by the most commonly used web browsers.

– Provide easy integration with web sockets.

– Fast enough to be able to process the high number of update messages received fromthe server.

– Be able to dynamically create, manipulate and present the web page.

I am by no means a web site designer when it comes to the actual looks and appeal of aweb site. However, I do think that I can provide good usability and will stick to a verysimple client page that fulfills all the presented requirements. Designing an extensive andcustomizable page is beyond the scope of this project.

3.1.5 System overview

The basic system structure is shown in figure 3.1. It illustrates how four clients use a webpage constructed in JavaScript/jQuery to connect to the system through a load balancingproxy. The system has three physical servers each running three web socket servers for atotal of nine. Each web socket server connects to TickCapture which is located at anotherphysical server.

3.2 Selection

After having studied the goals and requirements of the various parts of this project the nextstep would be to choose what languages, libraries, protocols and development environmentswould be used. Knowing what the requirements are will significantly aid in the choice.

3.2.1 Programming languages

There are a multitude of available programming languages. Each with their own benefits.My goal was to choose a language with widespread support and extensive open source andpermissive license libraries. Not inventing the wheel twice by using existing libraries willboth help in speeding up development and minimize the risk for major bugs surfacing lateron in the process.

3.2. Selection 31

Proxy (load balancer)

Server

Server

TickCapture

Websocket

server

Websocket

server

Websocket

server

Server

Websocket

server

Websocket

server

Websocket

server

Server

Websocket

server

Websocket

server

Websocket

server

Client

Web page

Client

Web page

Client

Web page

Client

Web page

Figure 3.1: System overview.

Websocket server and proxy

There was no explicit requirement that the finished product would have to run in bothWindows and Linux. TickCapture does run in both environments so I early on decided thatmy system would also meet these standards. My past experience includes a lot of C/C++programming so it was the first language that came to mind. It would provide a fast andportable platform provided that portable libraries were used. I settled for C/C++ for bothwebsocket server and the proxy. The fact that the API for TickCapture was available inC++ influenced my decision.

Client

I have been using PHP for most of my web pages in the past and it has provided a robustand easy platform to work with. While I was browsing for information about web sockets Ifound some example code showing how to create and connect to a websocket in JavaScript[16]. I had very little experience with JavaScript but I am more than happy to learnnew programming languages and after a bit more research I decided that JavaScript wouldprovide the necessary tools for my client side web page. JavaScript is very similar to C insyntax and features include:

– Functional programming and structured programming syntax.

– Dynamic and object based.

– Run time evaluation. Relies on a run time environment, e.g. a web browser.

– Supports regular expressions.

Please note that when I am talking about JavaScript I am actually referring to client sideJavaScript. All the code is executed by the browser.


3.2.2 Libraries

Having programmed in C/C++ before I had a good idea of suitable libraries that wouldprovide the functionality I needed. On the client side it was somewhat more difficult. I hadto rely on information from other users to make my choice. Ultimately, practical testingand licensing would decide if a library proved useful or not. The requirement was that thelicensing would have to be permissive.

Websocket server and proxy

I knew that the application had to be threaded and that there would be a lot of stringparsing. Working with threads in Windows and Linux differs enough to make it worthwhileto use a library that abstracts the actual implementation. Powerful string parsing can beachieved by using regular expressions [9]. Regular expressions are basically special textstrings for describing search patterns. They provide very powerful ways to match stringsand to extract data.

The Boost C++ library [4] provides a large list of libraries for many areas of use. Boostuses a permissive license and the community consists of several thousand people with manyvery active members. There are two libraries in Boost that are of special interest to me:

– Boost::Thread - Portable C++ multi-threading which abstracts the usage of threads.In Windows one would use Windows threads and in Linux pthreads. The Boost libraryhides this and provides a simple interface that works in both environments.

– Boost::Regex - Regular expression library. Provides powerful string parsing andmatching.

As the Boost libraries are available in both Windows and Linux, porting from Windows toLinux would be less complicated. In addition to the Boost libraries I naturally used theprovided TickCapture libraries to integrate with the system.

Client

Using JavaScript immediately makes a wast amount of libraries available. JavaScript iswidely used and recognized and one of the most useful libraries is jQuery [15]. It is afast and concise JavaScript library that simplifies the development of web pages. Notablefeatures include:

– HTML document traversing and DOM (Document Object Model) manipulation.

– Event handling.

– CSS manipulation.

– Animation support.

The ability to dynamically create and manipulate DOM elements will allow me to present thedata in almost any form and allow the user to interact with the page without having to reloadthe page when major changes occur. The websocket API allows me to register event handlersthat are triggered in the client. For instance, the OnMessage event in the websocket interfaceis triggered every time a message arrives from the server. This completely eliminates theneed to poll the server connection for new data.

3.3. Execution 33

Support for animation will provide the tools for making the user aware of subject updates.For instance, when a field for a subject changes, the value will flash red if the change wasnegative and green for a positive change.

jQuery can be licensed under either MIT License or GPL. The MIT license is permissiveso it falls within the requirements.

3.2.3 Tools

The development would mainly be done in Windows just because of the fact that my personalworkstation was running Windows. Regular builds in Linux ensures that any code compilesfine in both environments.

Windows

I chose to work in Visual Studio 2008. I have been using Visual Studio for most of myprojects and so the IDE is familiar to me.

Linux

Editing code in Linux was mainly done with editors such as Emacs or remotely from Win-dows using UltraEdit. Compiling was then made using a makefile using g++ as the compiler.The makefile ensures that compilation is easy enough for anyone who wants to build the soft-ware themselves and also defines all necessary dependencies to make incremental buildingpossible.

Source files were transferred back and forth during development to ensure that everythingcompiled on both platforms and that nothing was broken.

3.3 Execution

After discussions with the author of TickCapture, Hans Erik Thrane, I had a good ideaof what was needed by my websocket implementation and what aspects were of specialinterest. The first step was to investigate if the websocket protocol and API would workfor this project. The protocol is new and still in the stages of draft so there were notmany examples to disect. Luckily, the protocol and API were very well documented andthe specifications [13] were easy to follow.

3.3.1 The Websocket protocol

The websocket protocol specifies a procedure for setting up a connection between a serverand a client which includes a series of handshakes.

Handshaking

The simplified algorithm presented here is the version available at the time of writing. Theprotocol is still not finalized so it may change in future versions. The algorithm belowcontains only the very basic procedure needed to setup a connection. Visit the specificationfor the websocket protocol for a complete algorithm.

1. Client initiates a socket connection given a websocket URL. The URL is defined asws://example.com/resource.


2. Server accepts connection and waits for client to send handshake.

3. Client sends handshake to the server. The handshake is specified as:

GET /resource HTTP/1.1

Host: example.com

Connection: Upgrade

Upgrade: WebSocket

Origin: http://example.com

End by sending the following byte sequence: 0x0d, 0x0a, 0x0d, 0x0a

4. Server reads from client until the byte sequence 0x0d, 0x0a, 0x0d, 0x0a is found. Parsehandshake and verify validity of request.

5. If client handshake is sane, the server sends a handshake response to the client:

HTTP/1.1 101 Web Socket Protocol Handshake

Upgrade: WebSocket

Connection: Upgrade

WebSocket-Origin: http://example.com

WebSocket-Location: http://example.com/resource

End every line with the byte sequence 0x0d, 0x0a and finally send an extra 0x0d, 0x0ato end the handshake.

6. Client reads from server until an empty line followed by 0x0d 0x0a is found. Parsehandshake and verify validity.

7. If server handshake is sane, the connection is ready to transmit data.

Data framing

Once the handshaking procedure is complete and successful, both client and server can begintransmitting data. Data is sent in the form of messages delimited by special characters.Messages must always be encoded in UTF-8. Basically a message is defined as:

0x00 + [UTF-8 encoded string] + 0xff

The algorithm for reading a message:

1. Try to read one byte from the socket.

2. If the byte equals 0x00, do the following:

(a) Let data be an empty byte array.

(b) Read a byte. Let b be that byte.

(c) If b is not 0xff, append b to data and return to step (b).

(d) Interpret data as a UTF-8 string.

The corresponding algorithm for sending a message:

3.3. Execution 35

1. Send a 0x00 byte.

2. Send message encoded as UTF-8.

3. Send a 0xff byte.

Any data not matching the algorithm must be discarded.

Establishing connection

The following example shows the basic procedure to establish a websocket connection usingjQuery. The websocket is created and then three event handlers are registered with thewebsocket. OnOpen is called when the websocket connection has been established, OnClosewhen the connection has been closed and finally OnMessage is triggered when a message isreceived. These functions are completely asynchronous.

var websocket;

$(document).ready(function() {

if ("WebSocket" in window) {

websocket = new WebSocket("ws://example.com/resource");

websocket.onopen = OnOpen;

websocket.onmessage = OnMessage;

websocket.onclose = OnClose;

}

else

echo "Your browser does not support Websockets";

function OnOpen() {

// do initialization here

};

function OnClose() {

// do cleanup here

};

function OnMessage(evt) {

// do processing of message

ProcessMessage(evt.data);

};

function SendMessage(message) {

return websocket.send(message);

};

});

The function SendMessage is used to send messages to the server.

3.3.2 The TickCapture API

TickCapture comes with an existing API for C++ and libraries compiled for Windows andLinux. There were great examples on how to connect to TickCapture and start subscribing


to a feed. For testing purposes it is possible to setup a simulation environment that replaysa recorded feed instead of a real time live feed. This made it very easy to develop and testthe system.

Only three services were required to run to connect to the historical feed. The QuoteM-anager, Replay and the RealTimeManager services. For a real system subscribing to a realservice, several other services are required, e.g. DatabaseManager, MarketManager and theOrderRouter.

Establishing connection

To subscribe to the feed, a Dispatcher object is used. It connects to the RealTimeManagerand overloads functions that are called on given events. We inherit our Controller structfrom Dispatcher and overload functions that will be called by given events.

namespace tcap {

struct Controller :

public tcap::realtime::Dispatcher,

public tcap::realtime::Dispatcher::Handler,

public tcap::realtime::Record::Handler

{

Controller(const TCAP_RealTime_Dispatcher_Options options);

protected:

void onEvent(const TCAP_RealTime_Connect connect) const;

void onEvent(const TCAP_RealTime_Disconnect disconnect) const;

void onEvent(const TCAP_RealTime_Heartbeat heartbeat) const;

void onEvent(const TCAP_RealTime_SubscribeAck subscribe_ack) const;

void onEvent(const TCAP_RealTime_UnsubscribeAck unsubscribe_ack) const;

void onEvent(const TCAP_RealTime_PublishAck publish_ack) const;

void onEvent(const TCAP_RealTime_RequestAck request_ack) const;

void onEvent(const TCAP_RealTime_Update update) const;

void onEvent(const TCAP_RealTime_Request request) const;

void onValue(const TCAP_String name, const TCAP_Variant value) const;

private:

mutable std::ostringstream _buffer;

};

}

The most interesting onEvent functions are:

onEvent(const TCAP_RealTime_Connect connect)

Called when a connection to the RealTimeManager is established.

onEvent(const TCAP_RealTime_SubscribeAck subscribe_ack)

Confirms a subscription. subscribe ack contains information about the subject that hasbeen subscribed to and a full image of the fields and their values.

onEvent(const TCAP_RealTime_Update update)

Called for every update message. update contains information about what subject has beenupdated and a string containing the fields with their updated values. The message containsonly those fields that have changed since the last update, a so called delta update.

3.3. Execution 37

To save bandwidth and processing power, TickCapture uses delta messages to updatesubscriptions. A full image is only sent when the subscription is made.

To actually initiate the connection to RealTimeManager, you create a Controller andpass options to it.

TCAP_RealTime_Dispatcher_Options_t options = {

TCAP_LOGGER_DEFAULT, // logging options

"example.com", // RealTimeManager host

4004 // RealTimeManager port

};

Controller controller(&options); // create instance

controller.dispatch(controller); // dispatch

Message format

The subscription confirmations that contain the full images and the update messages con-taining the updated fields have the same structure shown below. The only difference is thatthe full image contains all fields and their values and the update messages only contain thefields that have changed since the previous message. The update messages are the maintype of messages routed through the system.

Variable subject contains the subject name and message contains the fields and theirvalues.

SUBJECT BOND1

MESSAGE

SEQNO=1096199

TIMESTAMP=10:03:29.881

BIDSIZE=76

ASK=97.82

An example of the contents of the message variable in an subscription confirmation canbe seen below. The fields are in the format field=value:

SEQNO=1074228 TIMESTAMP=28-MAY-2010 10:01:06.287 INSTRUMENT=12345 \

CLASS=CLASS TYPE=TYPE NAME=BOND1 DESCRIPTION=Example feed 1 \

CURRENCY=XXX FIRST_TRADE= LAST_TRADE= TICKSIZE=0.005 TICKVALUE=0 \

BID=116.555 BIDSIZE=4 ASK=116.56 ASKSIZE=38 LAST=116.555 LASTVOL=1 \

TIME=11:01:03.000 VOLUME=107815

Below is a sequence of update messages. Only the content of message is shown.

BOND1: SEQNO=1096199 TIMESTAMP=10:03:29.881 BIDSIZE=76

BOND1: SEQNO=1096200 TIMESTAMP=10:03:29.881 ASK=97.82

BOND1: SEQNO=1096202 TIMESTAMP=10:03:29.912 LASTVOL=1 VOLUME=50786

BOND1: SEQNO=1096204 TIMESTAMP=10:03:29.943 BID=124.375 BIDSIZE=64

BOND1: SEQNO=1096212 TIMESTAMP=10:03:29.990 ASK=1.277 ASKSIZE=1

BOND1: SEQNO=1096215 TIMESTAMP=10:03:30.053 ASKSIZE=3


Parsing messages

The Boost::Regex library provides functions for manipulating and matching strings withregular expressions. I am using it exclusively to process the messages received from Tick-Capture. Regular expressions provide a easy way to match strings and fields where fieldlengths and content are dynamic.

A regular expression is a special string used for string matching. For example, the regularexpression used to match each field in an subscription confirmation message or an updatemessage looks like this:

regexField = "([A-Z_]+)=([^=]+)(?:$| )"

Breaking up the regular expression and analyzing each individual part:

([A-Z_]+)=

This matches the field name. This is everything before the = character. The name canconsist of letters A-Z and the underscore ( ) character.

([^=]+)

The ˆ= enclosed in brackets means that this matches any character except a =. The + signafter the bracket expression means that it must match at least one character.

(?:$| )

Finally, the last part defines the end of a field. The expression matches either the end ofthe string ($) or a whitespace. The ?: symbolizes that we have several possible matches andthe | character separates the matches.

Any fields not matching the regular expression are simply ignored. Partial matching isnot permitted, the expression must match as a whole. This means that incomplete matchesdo not get processed and cannot cause problems.

Extracting all fields from a string is done simply by looping the regular expression overthe string until it does not return a match.

3.3.3 Proof of concept

I decided to implement a very basic system to show that the integration between browser,websocket server and TickCapture worked as intended. The basic implementation wouldinclude a very simple client written in HTML/JavaScript extended by jQuery and a basicwebsocket server that allows clients to connect and push messages from TickCapture directlyto the clients without processing.

Simple websocket server

The very basic websocket server would receive update messages from TickCapture andthen just push them on to any connected clients. The first implementation of the serverimplemented sequential processing of the messages and no threading of work was introducedat this stage.

3.3. Execution 39

Simple client

The client would do all the parsing and create a dynamic table with one row for each subjectand update the fields in the table as the messages arrive. jQuery makes manipulation ofDOM elements very easy. Let us examine how content of elements can be altered, e.g. ifyou want to change the value of a cell in a table. Consider the following simple table whichconsists of two rows with three columns.

<div id="TABLE1" class="table">

<div id="BOND1" class="tablerow">

<div id="ID" class="tablecol">BOND1</div>

<div id="LAST" class="tablecol">38.5</div>

<div id="VOLUME" class="tablecol">234345</div>

</div>

<div id="BOND2" class="tablerow">

<div id="ID" class="tablecol">BOND2</div>

<div id="LAST" class="tablecol">222</div>

<div id="VOLUME" class="tablecol">8813</div>

</div>

</div>

If we want to change the VOLUME of BOND2 from 8813 to 9000 in the second row wecan update the value of the element like this:

$("#TABLE1 #BOND2 #VOLUME").html("9000");

To address a div you must include all the parent elements by their id. The variable valuecan be anything you like, even HTML code containing new elements. If the value is

<div id="VAL1">0</div><div id="VAL2">1</div>

the new row will look like this:

...

<div id="VOLUME" class="tablecol">

<div id="VAL1">0</div>

<div id="VAL2">1</div>

</div>

...

Any future referencing of the new element VAL1 is done by:

value1 = $("#TABLE1 #BOND2 #VOLUME #VAL1").html();

With these techniques I can create, update and remove table rows as I want. I do nothave to create the rows in advance and discard any bonds that do not have rows allocatedin the table.

Simple system

Once the first implementation or proof of concept was up and running I was very happy tosee that the system provided the functionality that was required for further development.I could now connect to the system through a web page using the Google Chrome browser,receive messages in real time from the websocket server and present them in a table on theweb page. The connection would persist until I chose to abort it.


3.3.4 Parallelization

The first simple websocket server was essentially serial processing of messages. While thisversion did no actual parsing of the messages, the processing was fast and could very wellhave stayed serial. Introducing client states requires parsing of messages and storing currentimages of the subjects. This part of the code would have to be parallelized.

JavaScript does not support threading. JavaScript supports delayed execution and asyn-chronous callbacks but these are all run in the same thread and will halt all other executionuntil the function has completed. This means that I could do nothing about the client interms of distributing load.

My focus was finding tasks in the websocket server that could be executed in parallel.Figure 3.2 shows the serial flow of the websocket server.

Websocket ServerTickCapture

RealTimeManagerReceive

Messages

Client 1

Client 3

Client 2Process

Messages

Send

Messages

Figure 3.2: The serial flow of the Websocket server.

My first thought was to parallelize the processing and parsing of the messages. I wouldhave a message queue with incoming messages and then have a pool of threads processingthese messages as in Figure 3.3.

Websocket Server

Process MessagesTickCapture

RealTimeManagerReceive

Messages

Client 1

Client 3

Client 2

Thread1

Send

Messages

Message

QueueThread2

Thread3

Figure 3.3: Parallelizing the processing of messages.

The messages from TickCapture are serially dependent as each update message onlycontains the fields and values that have changed since the last message was sent. This in-troduces a serious problem. How do you parallelize something that is inherently sequential?Consider the following messages:



Let two threads process messages. Thread 1 processes the first message and thread 2the second message. After processing a message, the current image (list of fields and their

3.3. Execution 41

values) for BOND1 in the websocket server must be updated and then the update messageis forwarded to the subscribing clients.

If thread 1 processes the message, updates the image and sends the message to theclients before thread 2, everything is fine. But if thread 2 completes before thread 1, wehave a problem. There will be inconsistency in the current image. The correct value ofthe field BIDSIZE is 57 but instead it will read 64 which is incorrect. This scenario is verymuch possible without synchronization between the threads since threads can have differentprocessing time.

This can be remedied by synchronization and always blocking the other threads fromupdating the image and clients until the SEQNO (message sequence number) is the lowestnumber being processed. This introduces additional complexity and the requirement forsynchronization. For each message, the thread would have to check and wait to see if it canproceed. With possibly hundreds of messages per second, the synchronization can becomeexpensive.

Instead of parallelizing the processing of messages I decided to adopt a different ap-proach. I identified the various tasks that the websocket server would have to perform onthe messages and introduced queuing to avoid having the different stages doing unnecessarywaiting. The fundamental tasks were receiving messages, processing messages and finallysending the messages to clients.

Receiving messages

The object interacting with RealTimeManager and receiving the messages is restricted toone thread. For each message arriving, an onEvent function is executed in a sequentialmanner. The function is not called for the next message until the function has returned forthe first one. This means that I cannot do heavy processing in the function that will delaythe execution for the next message. The message should be dispatched to the next stage assoon as possible. I chose to implement a queue that would receive all new messages. TheonEvent function would only have to insert the new message into the queue and then return.

Processing messages

One of the requirements was that clients should be able to randomly disconnect and recon-nect to the system but also throttle bandwidth used to pass messages without compromisingthe consistency of values presented in the browser. This means that some sort of internalclient state will have to be stored in the websocket server to keep track of the latest valuessent to the client.

A current image for each subscription must be kept. It is constantly updated by thewebsocket server as the update messages arrive from TickCapture. When new clients connectto the websocket server, this image is sent to the client to ensure that the client receives themost up to date values.

If a client is connected and receiving messages, update messages from TickCapture arepassed on to the client without modification. The initial full image sent allows the client toprocess the update messages in the same way as the server updates the internal image.

When a client is disconnected or messages are being throttled, data must be storedto allow a client to reconnect and still receive the current values for each subscription.Simply storing the messages in outgoing client queue and then send them all when theclient reconnects is not realistic. The queue would quickly fill up and put the server underunnecessesary load. Sending several thousand messages at once when the client reconnectscan flood the connection and cause yet another disconnection.


A better approach is to merge messages belonging to the same subscription. Considerthese messages:



These two messages can be joined into one:


For a client having 20 subscriptions, joining messages in this way would limit the outgoingmessage queue to a maximum of 20 messages, one for each subscription. The single messagewould be expanded by any new fields and existing fields would be updated. Should thefollowing message arrive:

BOND1: SEQNO=1096206 TIMESTAMP=10:03:29.982 ASKSIZE=44

Then the previously merged message would become

BOND1: SEQNO=1096206 TIMESTAMP=10:03:29.982 BID=124.375 BIDSIZE=57 ASKSIZE=44

Each client has its own outgoing message queue. The processing thread inserts the pro-cessed messages into the client queues and merges the messages if the client is disconnected.

Sending messages

When messages finally arrive in the client queue, the sending is performed by any numberof threads. For a low number of clients, one thread per client would be optimal but asthe clients grow, there would just be too many threads. Instead, a configured number ofthreads will process the list of clients and send messages found in their queues. Havingseveral threads doing the sending has the benefit of minimizing the risk that a client candelay sending of messages to other clients by being slow. If one thread gets blocked, theother threads will continue servicing the other clients.

Concurrency and synchronization

With several message queues and threads concurrently accessing these queues, synchroniza-tion is needed to ensure consistency. Synchronization is expensive, especially with mutexes,so the number of synchronization points in the program should be kept to an absolute mini-mum. There are cases where you simply cannot avoid having a shared resource protected bya synchronization variable. One example is a message queue where one thread (the feeder)inserts new messages into the queue and another thread (the consumer) which removes mes-sages from the queue. With a mutex protecting the queue, it would be locked with everymessage insert and removal.

Sometimes a unique lock on a shared resource is not needed. If you have a single threadupdating a shared variable and several threads only reading the value of that variable, itwould be unnecessary for the reading threads to acquire a unique lock on the shared resourceeffectively blocking the other reading threads from accessing it.

This is where Boost::Thread shared lock comes in handy. It defines a lock that allowsseveral threads to access the resource. The shared lock can be upgraded to a unique lockby any thread, and only then will it acquire unique access to the shared resource. In thesingle writer/multiple reader scenario, the reader threads would use shared lock to access

3.3. Execution 43

the variable and every time the writer thread wants to modify the variable, it acquires anunique lock which then blocks all reading threads.

An example on this scenario is the list of clients connected to the websocket server. Thislist is protected by a mutex. Whenever a new client connects or is removed, an uniquelock is acquired on the mutex. For all threads accessing the list to read information storedabout a client, a shared lock is enough. As long as no information is modified, consistencyis guaranteed.

In the case of a queue where items are inserted and removed, a unique lock is the onlyway to protect the shared queue. The normal implementation is to lock the queue each timea new element is inserted (push) and also lock it each time a queue is removed (pop) fromthe queue. Figure 3.4 illustrates the basic implementation with one process inserting itemsand another process removing and processing items.

Queue Mutex

Receive

MessagesMessage Queue

Process

MessagesPush Pop

Figure 3.4: Queue protected by a mutex. A unique lock is acquired each time an item isinserted and removed from the queue.

It is easy to see that the number of locks required is the same as total insertions + totalremovals. For 1000 items moving through the queue, 2000 locks on the mutex are made.

I tried to come up with a way to minimize the number of locks and recognized thepossibility to have two queues instead of one. The clients inserting and removing itemswould be completely oblivious to this fact by the use of pointers. They only see one queue.There would be one pointer pointing to the queue where items are to be inserted and anotherpointer pointing to the queue where items are removed and processed. The pointers wouldswap queues by a given algorithm but they would never point at the same queue. Figure3.5 shows the internal state of the queue hidden from the outside.

Let P1 be a pointer to the queue where items are inserted and P2 a pointer to the queuewhere items are removed. Let Q1 and Q2 be two queues and M a mutex protecting thequeues. Initially, P1 points to Q1 and P2 to Q2. The pseudo code for inserting an item:

SharedLock(M);

P1.Insert(element);

Unlock(M);

The mutex gets locked, the element is inserted and the lock is then freed. The removalof items from the queue is a bit more special:

if (P2.IsEmpty()) {

UniqueLock(M);

SwapPointers(P1, P2); // P1 now points to Q2 and P2 -> Q1.

Unlock(M);

}


Queue Mutex

Receive

Messages

Message Queue 1

Process

MessagesPush Pop

Message Queue 2

P1 P2

Unique lock on mutex, swap queues

Queue Mutex

Receive

Messages

Message Queue 1

Process

MessagesPush Pop

Message Queue 2

P1 P2

Figure 3.5: Mutex protecting two internal queues. Two pointers available for queue accessare swapped by an algorithm so that the pointers never point to the same queue.

while (!P2.IsEmpty()) {

element = P2.Pop();

Process(element);

}

If the queue pointed by P2 is empty, the thread acquires a unique lock on M, swapspointers P1 and P2 so that P2 now points to the queue previously used to insert elements(Q1) and P1 now points to the empty queue Q2. After the swap, elements in P2 (whichnow contains elements inserted by the other thread) are processed until the queue pointedby P2 is empty. The inserting thread is blocked by the unique lock on M and when the lockis released, P1 points to Q2.

Analyzing the code above, the worst case scenario occurs when the queue never containsmore than one element. The algorithm removing items will then perform a lock, swap thequeues and process the single element. Next time the algorithm runs, queues are swappedagain and a new lock is performed and so we have one lock per element removed.

The real benefit of this algorithm comes when the queues grow. Let T1 be the threadinserting elements into the queue pointed by P1 and T2 the thread processing the elementspointed by P2. Let T1 insert ten elements into P1 before T2 discovers that P2 is empty.T2 will lock the mutex, swap the queues so that P2 now contains ten elements. T2 will nowloop through the ten elements without locking the mutex. The number of locks required byT2 to process ten elements has now decreased from ten to one. The only lock required isfor the pointer swap.

T2 never has to lock the mutex when it accesses P2 since T1 is inserting elements intoP1. There will not be any concurrency problems. It is important that the swapping ofpointers is done in the reading thread to prevent any removal of elements from the queuewhile the swap is taking place. Figure 3.6 shows program flow with pointer swaps.

The scenario in Figure 3.6 is very common in my queues, especially in the queue storingthe messages as they arrive from the RealTimeManager. Receiving and inserting messagesinto the queue is generally much faster than removing and processing them. This meansthat the queues can grow beyond a single element and my double queue implementation

3.3. Execution 45

Insertion thread T1 Processing thread T2SharedLock(M);P1.Insert(element);P1.Insert(element);P1.Insert(element);Unlock(M);

UniqueLock(M);SwapPointers(P1, P2);Unlock(M);

SharedLock(M); element = P2.Pop();P1.Insert(element); Process(element);P1.Insert(element); element = P2.Pop();Unlock(M); Process(element);

element = P2.Pop();Process(element);...

Figure 3.6: Thread 1 inserts three elements before Thread 2 is available to process them.After the pointer swap, T1 can continue inserting elements and T2 can process the previouslyinserted elements without locking the mutex.

should have a positive impact on the total number of locks required.

3.3.5 Threads in the system

As the messages passing through the system are serially dependent I decided to focus onsplitting up the various tasks into threads in an effort to make processing as efficient aspossible. There are several threads running in the system and I will try to describe whatthey do.

TickCapture Handler

Initiates the RealTime object. Every message arriving from TickCapture is handled by thisthread. It simply adds every received message into a TickCaptureMessageQueue.

TickCaptureMessageQueue Handler

Reads any messages from the TickCaptureMessageQueue and processes them. Dependingon the message type, the processing includes:

– Subscribe and unsubscribe messages - Subscribes and unsubscribes to feeds.

– Update messages - Parses the message, update the current subscription image andadds the message to the ServerToClientMessageQueue.

– Latency test messages - Performs timing to measure the internal queuing time.

ServerToClientMessageQueue Handler

Reads messages from the ServerToClientMessageQueue. Depending on the client destinationit adds the message to the corresponding client message queue. If the message is a broadcast


message, the message is added to all clients. If a client is marked as throttled or disconnected,this thread is responsible for merging messages in the client message queue until the clientcan receive messages again.

ClientMessageQueue Handler

This can be any number of threads. These threads scan the clients for any outgoing messagesin their queues and then process the outgoing messages by dispatching them.

ClientListener

Main thread for accepting client connections. It listens for connections to a socket and foreach connecting client, a ClientHandler thread is spawned to handle the client.

ClientHandler

Handles new clients. Every connecting client is assigned a thread which takes care of hand-shaking and establishing a websocket connection between client and server. This thread alsoreads incoming messages from the clients.

ClientThrottler

Monitors the current bandwidth utilized by every client. If a client has requested throttlingand the current bandwidth used by the client is higher than the requested bandwidth, theclient is marked as throttled and no update messages can be sent until the ClientThrottlerunmarks the client.

ProxyHandler

Handles all communication with the load balancing proxy. The proxy will send cookieinformation for new clients being redirected to this websocket server and the websocketserver has to store these cookie to authenticate the clients when they connect.

ResourceMonitor

Monitors all measurable aspects of the websocket server. It measures bandwidth, CPU andRAM usage, the number of clients connected, internal latency. This information is sent tothe load balancer on a regular basis so ensure that the load balancer can make correct clientredirections depending on the current websocket load.

Clean up thread

This thread monitors the client list and removes expired clients. Clients are consideredexpired when they have been disconnected for a period of time. At this point, the expiredclient and all information tied to it is removed. This means that the client can no longerreconnect to the websocket and authenticate using the previous cookie. It has to reconnectto the load balancer and receive new connection details.

3.3. Execution 47

3.3.6 Message route

There are several stages and queues in the message processing and the general schema fora TickCapture update message is presented in figure 3.7. The processing has several steps:

1. Message arrives from TickCapture. It is immediately inserted into the TickCap-tureMessageQueue. No processing is done in this stage.

2. The message is removed from the TickCaptureMessageQueue. If it is a subscribeor unsubscribe message from TickCapture, the feed is subscribed to or unsubcribedfrom. The message is then deleted and not processed further. If the message is anupdate message, it is parsed and the subscription image that this message belongs tois updated. The message is then added to the ServerToClientQueue.

3. The message is removed from the ServerToClientQueue. If the message is a broadcastmessage, it is duplicated so that each client gets one copy. The message is then insertedinto the selected client(s) outgoing message queue. If the client is disconnected orthrottled, the message is merged with existing messages for that subscription.

4. The ClientMessageQueueHandler threads process messages found in the client outgo-ing message queues and dispatches them to the clients. If sending is successful, themessage is deleted. If the sending fails, the client is marked as disconnected and themessage is not deleted.

TickCaptureHandlerTickCapture

MessageQueue

TickCapture

MessageQueue

Handler

Push PopServerToClient

MessageQueuePush

Client 1

Message Queue

ServerToClient

MessageQueue

Handler

Pop

Is client

disconnected or

throttled?

Merge Messages

Client

MessageQueue

Handler

Client

MessageQueue

Handler

Client

MessageQueue

Handler

Is broadcast

message?Duplicate MessageYes

No

Yes

Client 2

Message Queue

Insert Message Insert Message

Is client

disconnected or

throttled?

Merge Messages

Yes

No No

Push Push

Merge Merge

Pop and send

Figure 3.7: A message from TickCapture is routed through the websocket server to theclient.


3.3.7 Messages

The messages routed through the websocket server are limited to a few in this implementa-tion. The message format allows the system to be extended with new types if needed.

Format

All messages sent between the websocket server and client are encapsulated in tags and aredefined as:

<MSG TYPE="type">contents</MSG>

Type defines the content of the message and contents is the contents of the message. Acomplete list of messages can be found in Appendix A (Message Types).

3.3.8 Scaling out

Threading the application is a great way to use the resources in a single machine. Anotherway is to allow multiple instances of the application to run on the same server or on multiplephysical servers.

The websocket server can easily be run as multiple instances. Connection to RealTime-Manager can be done through named pipes or through sockets. The socket makes the systemremotely available. Each websocket server runs independently and has no real knowledge ofany other websocket servers also running.

The previous discussion about scaling out concluded that there must be some point ofentry to the system for the connecting clients. It is not realistic that each client should haveknowledge of every websocket server in the system. I decided to implement a load balanceror proxy which will provide a system abstraction to the clients.

The load balancer

Having already implemented the websocket server with successful handshaking and dataframing I decided to reuse that code for the proxy. I also chose to have all client authen-tication in the proxy. The proxy would decide if a client should be allowed to connect tothe websockets or not. But I would have to implement some sort of basic authentication inthe websocket server to stop any clients bypassing the proxy and connecting directly to aknown websocket server. The procedure for a client connecting to the system can be seenin figure 2.10. The connection procedure is as follows:

1. Client connects to Proxy.

(a) Client establishes connection to proxy by websocket protocol.

(b) Client sends authentication information (user and password) to the proxy.

(c) Proxy authenticates client.

i. If authentication fails, disconnect client immediately.

(d) Proxy must now allocate resources for client.

i. If there are no available websocket servers, fail the allocation immediately.

ii. Using known performance data provided by the websocket servers connectedto the proxy, select the least loaded websocket server.

3.3. Execution 49

iii. Generate a unique cookie. Currently a 64 byte string containing randomcharacters.

iv. Send the generated cookie to the selected websocket server.

v. Wait for a confirmation from the websocket server. This confirms that thecookie has been accepted and that the websocket server is ready to accept aconnection from the client.

A. If no confirmation is received within the timeout period, perform a newwebsocket server selection and return to step iv.

(e) Proxy sends the websocket connection information to the client. This informationcontains the host and port where the websocket is located but also the cookiethat must be used for authentication when connecting to the websocket server.

(f) Proxy closes the client connection.

2. Client connects to websocket server.

(a) Client establishes a connection to the websocket server by websocket protocol.

(b) Client sends the cookie information to the websocket server.

i. If the cookie does not exist in the websocket server, disconnect the clientimmediately.

(c) Websocket server starts sending the feed to the client.

The client is allowed to disconnect and then reconnect within a given time period. Thewebsocket server stores all client information before discarding it when the selected timeperiod has elapsed. A client wishing to reconnect has to follow this procedure (figure 2.11):

1. Client establishes connection to the websocket server that it was disconnected from.

2. Client sends the cookie information ot the websocket server.

3. Websocket server authenticates client.

(a) If the cookie exists in the websocket server and the client associated with thecookie has been marked as disconnected, the client is allowed to connect to thiswebsocket server.

(b) If the cookie does not exist or if the client associated with the cookie is marked asconnected, disconnect the client immediately. This prevents reconnecting clientsto use cookies already associated with other clients.

4. If authentication was successful, the websocket server sends the queued up messagesto the client and resumes normal operation.

5. If the client gets disconnected, the previously allocated resources have expired and theclient has to connect through the proxy to be assigned to a new websocket server.

Server selection

A simple loadbalancer would assign clients to the websocket servers in a simple round robinfashion. I have implemented a resource monitor in the websocket servers that collectsperformance data. The resource monitor samples the CPU usage, ram usage, bandwidth


and connected clients on a regular basis. This information is then sent to the proxy throughthe persistent connection with a given interval.

The proxy receives the resource updates from all connected websocket servers and main-tains a database of the current values for each websocket server. When a client connectsto the proxy requesting redirection to a websocket server, a websocket server selection isperformed.

Designing an algorithm that makes a correct decision based on the performance indica-tors gathered from the websocket servers is not trivial. Appropriate weighing of the variablesis essential. An example formula for selecting the least loaded websocket server W is:

W = (Cconnected

Cmax)2 + (CPUtotal)

4 + (CPUprocess)2 + (RAMtotal)

2 + (BWtotal)2 (3.1)

where C(connected) is the number of clients connected to the websocket server andC(disconnected) the number of clients marked as disconnected. CPU(total) is the CPU usagefor the entire system and CPU(process) for the websocket process. RAM(total) representsthe current memory utilization for the entire system and BW(total) is the bandwidth usedby the websocket server.

All the variables except the number of clients connected and disconnected are valuesbetween 0 and 1 where 1 represents 100% resource usage. The exponent symbolizes theweight. The higher the exponent, the lower the weight. If you compare the weight for CPUusage you see that the CPU usage for the websocket process is more important than thetotal CPU usage for the system. This is motivated by the fact that many of the servers thatthe system is tested on are running computations in the background which consume a lotof CPU. These computations often run with low priority so ”stealing” the CPU time to thewebsocket server is not a problem. This is why the CPU usage for the websocket process isof more interest. To illustrate the weighing, consider 90% CPU usage reported by the entiresystem compared to the same CPU usage reported by the websocket process.

CPUtotal(0.9) = 0.94 = 0.6561 (3.2)

CPUprocess(0.9) = 0.92 = 0.81 (3.3)

Exponentiation of the variables has characteristic that the value will approach 1 in anexponential fashion. If the variable is closer to 1 (100%), the bigger impact it will haveon the total load formula. This is why a system with 20% CPU load and 90% bandwidthallocation

CPUprocess(0.2) + BWtotal(0.9) = 0.22 + 0.92 = 0.85 (3.4)

is rated as having higher load than a system with 60% CPU load and 60% bandwidthallocation:

CPUprocess(0.6) + BWtotal(0.6) = 0.62 + 0.62 = 0.72 (3.5)

If any of the variables in the equation are close to their max, they will have a bigger impacton the total load value. A system that has exhausted 90% of its bandwidth but only has a20% CPU load should be considered to be heavily loaded.

The tests will show if the websocket server selection algorithm works well.

3.3. Execution 51

Running multiple instances

Configuring multiple instances of the websocket server is relatively easy. The server willautomatically detect the IP address and port that is it listening on and if the ClientPort is setto 0 in the websocket configuration, the server will automatically select a free port to listenon. This means that you can start several websocket servers with the same configurationfile. Each websocket server will choose a different port to listen on.

The most important setting is the IP address and port of the load balancing proxy. TheProxyPort in the websocket configuration file (figure 3.8) must match the WebSocketPortsetting in the proxy configuration file (figure 3.9). Each websocket server establishes andmaintains a connection to the proxy using this setting. Whenever this connection fails, theproxy will mark the websocket server as unavailable.

<Config>

<ClientPort>0</ClientPort>

<NumServerThreads>2</NumServerThreads>

<ProxyHost>localhost</ProxyHost>

<ProxyPort>51000</ProxyPort>

<ResourceMonitorInterval>2000</ResourceMonitorInterval>

<ResourceMonitorPeriod>5000</ResourceMonitorPeriod>

<MaximumBandwidthKBPS>10000</MaximumBandwidthKBPS>

</Config>

Figure 3.8: An example configuration file for the websocket server.

<Config>

<ClientPort>50000</ClientPort>

<WebSocketPort>51000</WebSocketPort>

</Config>

Figure 3.9: An example configuration file for the proxy.

Future versions might have a backup proxy configured in the websocket server but forthe time being, replacing a failed proxy is left to network logic. Whenever the connectionto the proxy fails, the websocket server will try to reconnect forever.

It would also be very possible to run several proxies. However, only one proxy server canbe assigned to each websocket server at any time. An example system could have websocketservers 1-10 assigned to proxy 1 and websocket servers 11-20 to proxy 2. These systemswill be totally independent of eachother. Merging them would need the websocket serversto connect to both proxies so that all 20 servers are registered in the two proxies.

3.3.9 Client

The client was implemented using HTML and JavaScript extended with the jQuery library.This means that any browser supporting these languages and the Websocket protocol canact as a client. Currently, popular browsers such as Internet Explorer and Mozilla Firefoxdo not have support for websockets. Firefox is scheduled to have support by version 4.0(current version is 3.6) but I have no information on Internet Explorer.


All development and testing has been done with Google Chrome which has had supportsince version 4.0.249.0 [1]. As the websocket protocol is not final, the current implementationof my websocket server could possibly break with a future version of the protocol. The latestversions that have been verified to work are:

– Google Chrome [11]: v5.0.375.55.

– The Websocket API [12]: 1 June 2010.

– The Websocket protocol [13]: 23 May 2010.

I have described the way to establish a websocket connection with and how to manipulateDOM elements in a previous chapter. The client is very simple. The action is driven by thecallback functions registered to the websocket connection and a main program timer. TheonMessage callback registered with the websocket connection is called whenever a messagefrom the websocket connection arrives. The callback performs the following tasks:

1. Parse message using regular expressions.

(a) If the message is an update message: Update the appropriate row in the sub-scription table with new values and flash the fields that have changed.

(b) Resource message: Contains the resource usage variables for the current web-socket server.

(c) Throttle message: Notify the user that throttling has been set by the websocketserver.

(d) Pong message: Measure the round trip time client - websocket server - client.

The callback performs all the asynchronous tasks. The main timer provides a synchronousupdating of the client. The timer performs the following tasks with a given interval.

1. Measure round trip. Sends a PING message to the websocket server containing atimestamp which is then returned as a PONG message by the server to allow theclient to measure the time elapsed.

2. Update resource fields. Prints the current resource usage variables.

3. Update graph. Displays a graph with the last values for each subscription.

User interaction is possible by defining a proxy host and port other than the default valuesto connect to. When the websocket connection has been established, a button for discon-necting/connecting to the websocket server is added. The client also has the option to setbandwidth throttling and disable the graph.

The client interface is separated into various fields as shown in figure 3.10.

Server stats

Resource usage values are displayed here.

– CPU Usage Total - Total CPU usage in percent for the system that the websocketserver is running on.

– CPU Usage Process - CPU Usage in percent for the websocket server application.

3.3. Execution 53

– RAM Usage Total - Total RAM usage in percent for the system that the websocketserver is running on.

– Roundtrip latency - The time it takes for a PING message to be sent from the clientand returned by the server. In milliseconds.

– Internal latency - The queuing time for messages in the websocket server. Measuredin milliseconds as the time a message arrives from TickCapture until it has beenprocessed and is ready to be sent to the client.

– Clients connected - The total number of clients connected to the websocket server.

– Message throughput - The total number of messages per second sent by the websocketserver to all connected clients.

– Bandwidth usage - The total bandwidth used to send messages to all connected clients.Measured in KByte/s.

– Bandwidth usage - The total bandwidth used expressed in percent. The maximumavailable bandwidth for the websocket server is set in the websocket server configura-tion file.

Figure 3.10: The client interface.


Graph

The graph displays the evolution for the subscriptions and is updated every second. Theuser has the option to disable the graph to reduce load on the client system.

Legend

The legend is simply a description of what subscriptions are displayed in the graph. Futureversions of the client could allow fields to be positioned any way the user wants, this is onereason for the legend being placed in a separate field.

Tick table

Contains all the subscriptions provided by the websocket server. As messages for newsubscriptions are pushed from the server, they are automatically added to the table. Whenupdate messages arrive, the row containing the target of the update is updated and thefields that have changed are flashed. A green flash indicates that the value has increasedand a red flash means the value has decreased.

Control

The client can define a load balancer (proxy) to connect to. A default proxy is provided.The WS host field shows the current websocket server allocated and once the connectionhas been established, the client can use a button to disconnect and connect to the websocketserver. The Throttle field allows the client to throttle the bandwidth. Values are entered inKByte/s. The server will confirm throttling by sending a message which is then displayedin the Status field.

Status

A simple log which displays the common events and any messages from the server. Majorfailures are colored in red.

Chapter 4

Results

To verify that the system works as expected and to test the performance, the proxy wassetup in such a way that the websocket servers would send performance data every secondto the proxy. The proxy would then output this data into a XML formatted file which iseasily imported into Microsoft Excel where diagrams were generated.

4.1 Testing platform

The system would be spread out on a number of servers running Debian GNU/Linux 5.0.Hardware consists of 1x Intel Core 2 Quad 2.5GHz CPU and 4GB RAM. The networkconnection is 100MBit Ethernet.

4.2 Testing setup

The setup consists of a number of servers running different parts of the system.

– One server running the TickCapture system. Necessary components enabled for thetests: QuoteManager, Replay and RealTimeManager.

– Four servers running one instance of the websocket server each.

– One server acting as proxy (load balancer).

– Four servers running the test client application. Spawns any number clients thatconnect to the system.

All in all, 10 servers are utilized for the tests. They are all running the same hardwareconfiguration and are located on the same subnet.

4.3 Tests

A number of tests were performed and performance data was collected.

55

56 Chapter 4. Results

4.3.1 Scaling of a single websocket server

One websocket server is setup together with a proxy. A test client will then spawn virtualclients that connect to the system with a delay of about 0.5 seconds. The clients will stayconnected for 150 seconds after which they will disconnect and the test will end. Data iscollected and I will primarily measure CPU utilization throughout the test. Testing will bedone with 50, 100, 200 and 400 clients.

0

10

20

30

40

50

60

1 31 61 91 121 151 181

Seconds

Cli

en

ts

0

5

10

15

20

25

30

35

40

45

50

55

CP

U %

clients disconnected cpu usage process cpu average

Figure 4.1: 50 clients connecting to a single websocket server.

The CPU utilization in figure 4.5 for the three tests below shows that the scaling forthe average utilization is more or less linear with the number of clients up to 200 clientsafter which something happens and the CPU utilization actually decreases when using 400clients. Why this is so will be discussed later on.

The clients stay connected for 150 seconds. In the graph, the CPU utilization risessharply when the clients start disconnecting. This is due to the merging of messages thattakes place for each disconnected client that I have described earlier. The process of mergingmessages is quite CPU intensive and the tests show an extreme case where many clientsdisconnect from the server within a limited time period.

4.3.2 Scaling and load balancing with four websocket servers

Three websocket servers are added to the system for a total of four. Connections to thewebsocket servers are managed by a proxy. Four test clients will spawn 50 connections eachthat connect to the system with a delay of about 0.5 seconds. The clients will stay connectedfor 150 seconds after which they will disconnect and the test will end.

Figure 4.6 shows the results from the test. Each websocket server has been configuredwith the same parameters so the websocket server selection algorithm should distribute

4.3. Tests 57

0

20

40

60

80

100

120

1 31 61 91 121 151 181

Seconds

Cli

en

ts

0

5

10

15

20

25

30

35

40

45

50

55

CP

U %



0

50

100

150

200

250

1 31 61 91 121 151 181

Seconds

Cli

en

ts

0

5

10

15

20

25

30

35

40

45

50

55

CP

U %




0

50

100

150

200

250

300

350

400

450

1 31 61 91 121 151 181 211 241 271 301 331 361 391

Seconds

Cli

en

ts

0

5

10

15

20

25

30

35

40

45

50

55

CP

U %

clients disconnected cpu usage process


Clients CPU average CPU max50 4.08% 20%100 7.14% 24%200 15.62% 54%400 12.02% 29%

Figure 4.5: CPU utilization.

the clients very evenly and reading the results from the test, the number of clients perwebsocket server is very similar and the CPU utilization is therefore also very similar forthe four servers.

Again, the CPU utilization rises sharply as the clients disconnect in the end of the test.

What happens when we configure a single websocket server with a lower maximumavailable bandwidth? The configuration parameter MaximumBandwidthKBPS is set to halfof that of the three other websocket servers. The setup is similar to the previous test withthe difference that clients will now randomly disconnect and reconnect to the system. If aclient fails to reconnect to the websocket server within a given time period, the associatedcookie will be cleared from the websocket server and the reconnection will fail. The clientwill then automatically connect to the proxy instead and be assigned a new websocket serveraccording to the current load of each server. This is why the number of clients per websocketserver varies over time.

Figure 4.7 clearly shows that websocket server 4 gets allocated about half the number ofclients compared to the other three servers that have more available bandwidth.

Configuring all websocket servers with different available bandwidth should result in a

4.3. Tests 59

0

10

20

30

40

50

60

1 31 61 91 121 151 181

Seconds

Cli

en

ts

0

2

4

6

8

10

12

14

16

18

20

CP

U %

clients 1 clients 2 clients 3 clients 4 cpu usage process 1 cpu usage process 2 cpu usage process 3 cpu usage process 4

Figure 4.6: 200 clients connecting to a four websocket servers.

distribution of clients according to this setting. Websocket server 1 has a bandwidth of1000 KB/s, websocket server 2 has 750 KB/s, websocket server 3 has 500 KB/s and finallywebsocket server 4 has been assigned 250 KB/s. Figure 4.8 displays the distribution ofclients with this configuration. The websocket server selection algorithm seems to performvery well.


0

10

20

30

40

50

60

70

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496

Seconds

Cli

en

ts

clients 1 disconnected 1 clients 2 disconnected 2 clients 3 disconnected 3 clients 4 disconnected 4

Figure 4.7: 200 clients connecting to a four websocket servers where one is configured withhalf the available bandwidth.

0

10

20

30

40

50

60

70

80

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 406 421 436 451 466 481 496

Sekunder

Kli

en

ter

clients 1 disconnected 1 clients 2 disconnected 2 clients 3 disconnected 3 clients 4 disconnected 4

Figure 4.8: 200 clients connecting to a four websocket servers where each one is configuredwith different available bandwidth.

Chapter 5

Conclusions

The previous chapter reveals both positive and negative things. The positive is that thewebsocket server selection algorithm performs well and distributes the clients according tothe server load. The negative is that going past 200 clients in the tests, the linear scalingwith the number of clients fails. Obviously, there is some bottleneck in the system preventingmaximum CPU utilization.

Monitoring the server CPU utilization in real time reveals that it very rarely goes past25% on a quad core system, effectively using one CPU core 100%. The sequential natureof message processing is a probable reason for this. The system can never be faster thanthe initial processing of the messages (which is sequential). The sending of the messages tothe clients is not very CPU intensive. Only when a client is disconnected and the messageshave to be merged in the outgoing client queue it will show in CPU utilization. This is whywe can see a clear increase in CPU utilization when clients begin to disconnect in the finalsequence of the tests.

5.1 Analyzing performance

The current implementation uses stl::map [2] for storing the fields and values for the parsedupdate messages arriving from TickCapture. Storing is done in a (key, value) format whereboth key and value are strings. Creating a map is expensive as the map is sorted whenevera new element is inserted.

The cost of accessing elements in a map (which often is implemented as a balancedbinary tree) is O(log N) where N is the number of elements in the map. However, usingstrings as keys adds complexity as the length L of the strings have to be considered whencomparing the keys. The cost then becomes O(L) * O(log N) or O(L*log N).

Clearly, an implementation using integers as keys would yield better performance but itwould require some sort of mapping of the key (the field name) to an integer value whenparsing the messages and vice versa when sending the values to the clients in a string format.

Parsing of the messages and splitting up the string into a stl::map is limited to one threadso this task will be restricted by the performance of a single CPU core. The messages must bekept in the order of arrival and using several threads for parsing would add synchronizationcomplexity. Obviously, any slowdown in the parsing of messages will limit the throughputfor the entire system so it is essential that messages are processed quickly.

Next, the messages are duplicated and sent to the outgoing queues of the clients. Themessages contain the original string but also the map containing the fields and their values.

61

62 Chapter 5. Conclusions

Copying is done using the copy constructor and it is unclear what penalty the copying of amap introduces. This can potentially be a significant resource hog as both key and value inthe map are stored as strings.

The final step in the processing is sending the messages to the clients. If the client isconnected, the message is simply sent and then deallocated. If the client is disconnected, themessage is merged with existing messages in the queue to prevent the message queue fromgrowing indefinitely. The cost of merging is high. The messages are stored as a stl::queuewhere searching for an element has the cost of O(N) where N is the number of elements.First, the algorithm has to find the message containing the corresponding subscription in thequeue. Identification is done by string comparison. This adds O(L) to the cost where L is thelength of the subscription string. When the correct message has been found, the messagesmust be merged. As previously discussed, accessing elements a the stl::map with strings askeys has the cost O(L*log N). Summarizing the cost for merging a message containing 5fields with a message queue already containing 20 messages:

– Finding the corresponding message containing the subscription to be merged: O(L*N)where N is the number of messages in the queue and L the length of the subscriptionname.

– For each field in the message:

• Finding the corresponding field to be updated in the map: O(L * log N) where Nis the number of elements (fields) in the map and L the length of the field name.

The number of comparisons grows very fast and it is easy to see how merging messagesis costly performance wise.

5.2 Improving performance

5.2.1 Numerical representation instead of strings

The first step to improving the performance is to eliminate the string representation of fieldsand their values in the internal processing of the websocket server. Matching field namesand copying values as strings is slow in comparison to using integers or doubles. Mappingthe field name ASKSIZE to integer (or byte) value 1 and then converting the string valueof 23.441 to a double representation would not only conserve memory but also speed up allmatching. The string ASKSIZE=23.441 consisting of 7 (ASKSIZE) + 6 (23.441) = 13 bytescould be converted to 1 byte for the field name and 1 double (4 bytes) for the value.

Mapping the name would be done one time by the process receiving messages fromTickCapture and subsequent operations would be performed on the numerical values. Thecurrent client implementation expects string representation of fields and values so the valueswould then have to be converted back to strings before sending the data to the clients. Usingan enumeration to map the byte representation of the field to a string would cost O(1) usingthe byte representation as the index.

Receiving messages from TickCapture in double format without the need for conversionwould be the optimal implementation performance wise but it would still require conversionfrom numerical to string values by the client or in the server before sending the messages.

5.2. Improving performance 63

5.2.2 Eliminating message duplication

If clients were not allowed to disconnect and reconnect, the system would not have tomaintain a state for disconnected clients. Disconnected clients must be guaranteed to receivea consistent state once they reconnect, that is why messages are stored and merged inthe outgoing client queue until the client has reconnected or expired. Messages that aresuccessfully sent to clients are immediately deleted but failed messages cannot be deleteduntil the client has received the information. Duplicating the messages so that each clientgets a copy makes it easy to control message handling and implementation wise it is verystraight forward.

The problem is that it introduces unnecessary resource usage and duplicating messagescan be very expensive, especially if strings are involved.

A future solution would be to only duplicate messages for clients that are disconnected.For a system with 100 connected clients and 20 disconnected clients, this change wouldmean that the number of copies of a message would decrease from 120 (100 + 20) to 21 (1+ 20), a reduction of 82.5% which is a significant improvement.

5.2.3 Multiplexing

Currently, each client is assigned an individual thread in the system. The thread listens tothe client socket and acts on the received data. The threads are completely independentand can read data simultaneously. At the moment, the data mainly consists of throttle andping messages which are easy to process. These threads do not have much to do but withmany clients connected, the number of threads rapidly increases. Swapping threads intoCPU registers is costly so having hundreds of them can introduce a serious waste of CPUcycles in the system.

Thread 1

Socket

read()

Thread 2

Socket

read()

Thread 3

Socket

read()

Thread 4

Socket

read()handle_data()

Figure 5.1: Managing sockets with one thread per socket.

A solution to this problem would be to have a single thread handling all client connectionsand multiplexing the sockets. What this means is that you have a function, usually select()which listens on a number of file descriptors (sockets) for a given time period. Whenevera file descriptor reports that data is available, or a timeout occurs, select() returns. Theread will then loop through the file descriptors and acting on those that have a flag set thatindicates that data is available. When the handling is complete, the thread will return tolistening on the file descriptors with select(). Figure 5.2 illustrates this behaviour.


This way you can have a single thread monitoring hundreds of client sockets for datainstead of having one thread per socket like in figure 5.1. Like always, it is up to the imple-mentation to handle the data from the client. It is not guaranteed that the entire messagecan be read from the socket at once so some sort of buffer or state must be maintained foreach socket to store data until it contains a complete message.

Using select() and a single thread for handling sockets means that the data processingis done sequentially on the sockets. It is important that processes that are time consumingdo not block the handling of the other sockets. This can be solved by launching heavy tasksin new threads and then immediately handle the next socket.

Thread 1

Socket Socket Socket Socket Socketselect()

Socket Socket Socket Socket Sockethandle_data()

Figure 5.2: Multiplexing sockets with one thread.

5.2.4 Fine tuning server selection

The current server selection algorithm does not take into account the differences in serverhardware configurations. Ignoring the other performance parameters, a server having 10%CPU utilization is always considered to be a better choice than a server having 20% utiliza-tion. It does not know that the less loaded server has a 100 MHz CPU and that the onewith higher load is equipped with a quad core 3.2 GHz CPU and that it would probablyprovide faster service than the less loaded server.

Introducing some sort of server capacity index would remedy this and allow the algorithmto always prefer faster servers in its selection.

Another important factor in the selection would be based on geographical location of theserver and client. Latency is of great intrest when presenting real time market data so havinga client in Sweden be assigned a server in Japan only adds unnecessary latency. Knowingthe location of both parties allows the algorithm to select the closest server provided thatit is not overloaded.

5.3 Future Work

Analyzing the performance and identifying areas that would benefit from optimization pro-vides a list for future changes that no doubt will increase capabilities of the single websocket

5.3. Future Work 65

server. I could spend an infinite amount of time optimizing the software but it was neverthe main goal of this work.

The current system provides scaling by introducing additional websocket servers andhaving a proxy coordinating the assigning of clients. The system can be dynamically ex-panded by just launching new servers which will automatically be registered and used bythe proxy. Similarly, a less loaded system can shut down unused websocket servers whenthey are not needed.

In hindsight, a closer development / performance analysis cycle would have been better.Maybe then I would not have opted for the one thread per client design or regular expressionsfor parsing the messages. Relatively easy to implement but perhaps not the best choices foroptimal performance.

Chapter 6

Acknowledgements

I would like to thank Hans Erik Thrane for giving me the opportunity to do this project andalso for his invaluable help during the development, not only providing technical expertisebut also giving me ideas on alternate approaches for solving problems.

A big thank you also to my internal supervisor at Umea University, Mikael Rannar forthe support and guidance he has provided throughout the work.

Not to be underestimated, the moral support provided by persons close to be has greatlycontributed to this work. You know who you are.

67

68 Chapter 6. Acknowledgements

References

[1] Web sockets now available in google chrome, December 2009. http://blog.chromium.org/2009/12/web-sockets-now-available-in-google.html.

[2] C++ : Reference : Stl containers : map, June 2010. http://www.cplusplus.com/

reference/stl/map/.

[3] Ip virtual server, 2010. http://www.linuxvirtualserver.org/software/ipvs.html.

[4] Regular-expressions.info, May 2010. http://www.boost.org.

[5] Kanwardeep Singh Ahluwalia. Scalability design patterns, 2007. http://hillside.

net/plop/2007/papers/PLoP2007_Ahluwalia.pdf.

[6] Gene Amdahl. Validity of the single processor approach to achieving large scale com-puting capabilities. AFIP Conference Proceedings, 30:483–485, 1967.

[7] Willy Chiu. Design for scalability, 2001. http://www.ibm.com/developerworks/

websphere/library/techarticles/hipods/scalability.html.

[8] Brad Fitzpatrick. Memcached - distributed memory object caching system, 2010. http://memcached.org.

[9] Jan Goyvaerts. Regular-expressions.info, 2010. http://www.regular-expressions.

info.

[10] John L. Gustafson. Reevaluating amdahl’s law. Communications of the ACM archive,31(5):532–533, May 1988.

[11] Google Inc. Google chrome, June 2010. http://www.google.com/chrome/.

[12] Ian Hickson. Google Inc. The web socket api, June 2010. http://dev.w3.org/html5/websockets/.

[13] Ian Hickson. Google Inc. The web socket protocol, May 2010. http://tools.ietf.

org/html/draft-hixie-thewebsocketprotocol-75.

[14] Intel. Amdahl’s law, gustafson’s trend, and the performance limits of par-allel applications, 2009. http://software.intel.com/en-us/articles/

amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications.

[15] The jQuery Project. jquery, May 2010. http://jquery.com.

[16] Mozilla. About javascript, 2010. https://developer.mozilla.org/en/About_

JavaScript.

69

http://blog.chromium.org/2009/12/web-sockets-now-available-in-google.html

http://blog.chromium.org/2009/12/web-sockets-now-available-in-google.html

http://www.cplusplus.com/reference/stl/map/

http://www.cplusplus.com/reference/stl/map/

http://www.linuxvirtualserver.org/software/ipvs.html

http://www.boost.org

http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf

http://hillside.net/plop/2007/papers/PLoP2007_Ahluwalia.pdf

http://www.ibm.com/developerworks/websphere/library/techarticles/hipods/scalability.html

http://www.ibm.com/developerworks/websphere/library/techarticles/hipods/scalability.html

http://memcached.org

http://memcached.org

http://www.regular-expressions.info

http://www.regular-expressions.info

http://www.google.com/chrome/

http://dev.w3.org/html5/websockets/

http://dev.w3.org/html5/websockets/

http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-75

http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-75

http://software.intel.com/en-us/articles/amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications

http://software.intel.com/en-us/articles/amdahls-law-gustafsons-trend-and-the-performance-limits-of-parallel-applications

http://jquery.com

https://developer.mozilla.org/en/About_JavaScript

https://developer.mozilla.org/en/About_JavaScript

70 REFERENCES

[17] Arnon Rotem-Gal-Oz. Fallacies of distributed computing explained, 2006. http://

www.rgoarchitects.com/Files/fallacies.pdf.

[18] Hans Erik Thrane. Tickcapture, 2010. http://www.tickcapture.com.

[19] International Telecommunication Union. Open systems interconnection - basic referencemodel: The basic model, 1994. http://www.itu.int/rec/T-REC-X.200-199407-I/

en.

http://www.rgoarchitects.com/Files/fallacies.pdf

http://www.rgoarchitects.com/Files/fallacies.pdf

http://www.tickcapture.com

http://www.itu.int/rec/T-REC-X.200-199407-I/en

http://www.itu.int/rec/T-REC-X.200-199407-I/en

Appendix A

Message Types

The following is a list of message types currently available in the system.

Type AUTHSource Websocket serverDestination Proxy

DescriptionThe websocket server authenticates itself with the proxy and sendsthe connection details specified in the SERVERURL tag.

Example

<MSG TYPE="AUTH">

<USER>Websocket</USER>

<PASS>12345</PASS>

<SERVERURL>ws://192.168.0.2:4009</SERVERURL>

</MSG>

Type PROXY AUTHSource ClientDestination Proxy

Description The client authenticates itself with the proxy.

Example

<MSG TYPE="PROXY_AUTH">

<USER>Nikolai</USER>

<PASS>12345</PASS>

</MSG>

71

72 Chapter A. Message Types

Type COOKIE NEWSource ProxyDestination Websocket server

DescriptionContains a proxy generated cookie that will be used by the clientwhen connecting to the websocket server.

Example

<MSG TYPE="COOKIE_NEW">

<COOKIE>

EIloRVY259rNbfILPsvY369WLCfimPSw03663QcGJ...

</COOKIE>

</MSG>

Type COOKIE ACKSource Websocket serverDestination Proxy

DescriptionThe websocket server acknowledges the cookie sent by the proxy.It is now ready to accept a connection from a client using thiscookie.

Example

<MSG TYPE="COOKIE_ACK">

<COOKIE>


</COOKIE>

</MSG>

Type COOKIE AUTHSource ClientDestination Websocket server

DescriptionThe client authenticates itself with the cookie assigned by theproxy.

Example

<MSG TYPE="COOKIE_AUTH">

<COOKIE>


</COOKIE>

</MSG>

Type CONNECTION DETAILSSource ProxyDestination Client

DescriptionContains the connection details that the client will use to connectto the selected websocket server.

Example

<MSG TYPE="CONNECTION_DETAILS">

<COOKIE>


</COOKIE>

<SERVERURL>ws://192.168.0.2:4009</SERVERURL>

</MSG>

73

Type QUOTESource Websocket serverDestination Client

DescriptionContains fields and values for a feed. The message contains onlythe fields that have changed. The first QUOTE message sent tothe client always contains all available fields.

Example

<MSG TYPE="QUOTE">

<ID>BOND1</ID>

<ASKSIZE>476</ASKSIZE>

<BIDSIZE>409</BIDSIZE>

<SEQNO>1000029454838</SEQNO>

<TIME>10:18:14.000</TIME>

<TIMESTAMP>03-JUN-2010 09:18:14.878</TIMESTAMP>

</MSG>

Type THROTTLE BYTESSource Client & websocket serverDestination Client & websocket server

Description

The client sends this message when it wants the websocket serverto throttle the bandwidth between the client and websocket server.To confirm the throttle set, the websocket server sends back thesame message to the client.

Example <MSG TYPE="THROTTLE_BYTES">10240</MSG>

Type DISCONNECTSource ClientDestination Websocket server

Description

Client sends this message to the websocket server to indicate aclean disconnect. The web socket server will then exclude thisclient from further processing, not updating the client state an-ticipating a client return.

Example <MSG TYPE="DISCONNECT"></MSG>

Type PING & PONGSource Client & websocket serverDestination Client & websocket server

DescriptionThe PING is sent by the client to measure round trip time to thewebsocket server. The websocket server responds with a PONGcontaining the same time stamp as in the PING message.

Example<MSG TYPE="PING">1275557445979</MSG>

<MSG TYPE="PONG">1275557445979</MSG>

74 Chapter A. Message Types

Type RESOURCESSource Websocket serverDestination Proxy

DescriptionThe message contains performance indicators for the websocketserver. These values are then used by the proxy in the serverselection algorithm.

Example

<MSG TYPE="RESOURCES">

<RESOURCE NAME="CL_C" ID="3">

<MIN>0</MIN><MAX>0</MAX><VAL>6</VAL>

</RESOURCE>

<RESOURCE NAME="CL_D" ID="4">


</RESOURCE>

<RESOURCE NAME="MPS" ID="0">


</RESOURCE>

<RESOURCE NAME="BPS" ID="1">


</RESOURCE>

<RESOURCE NAME="BW" ID="2">


</RESOURCE>

<RESOURCE NAME="CPU" ID="5">


</RESOURCE>

<RESOURCE NAME="CPU_P" ID="6">


</RESOURCE>

<RESOURCE NAME="RAM" ID="7">


</RESOURCE>

<RESOURCE NAME="LATENCY" ID="8">


</RESOURCE>

</MSG>

Pushing real time data using HTML5 Web SocketsPushing real time data using HTML5 Web Sockets Nikolai...

Documents

Transcript of Pushing real time data using HTML5 Web SocketsPushing real time data using HTML5 Web Sockets Nikolai...