CIO Interview about Flopsar APM - Application Performance Management

12
Galaxy or the escape from illusion Michał Zabiełło A new way to visualize system performance developed by a Polish company has been gaining recognition. The solution is already used by several dozen Polish companies and resolutely cuts through the well- known weaknesses of APM solutions. One of the elements which may implement rational savings in IT is the group of tools for application performance management (APM). Large corporations are investing in purchases of APM tools. The providers of such solutions are implementing tens of dashboards, hundreds of graphs and flow diagrams. They define thousands of various alerts and inundate the mailboxes of relevant recipients with messages about the “health check” of business processes. This is designed to convince that the scattered IT infrastructure is under control. It all works until a serious malfunction occurs. IT specialists try to identify the cause of the problem, analyze millions of out-of-date, unnecessary or erroneous pieces of information coming from the implemented tools. Bombarded by alerts The tools to diagnose or monitor applications are of key importance. Good tools are expensive – they require many laboratory checks, tests, and a precise manufacturing process. Good and expensive tools are, in turn, complicated. It is worth noting that such products have a specific methodology connected with performance management: we install a tool, configure the scope of reported metrics and build a complicated “health check” application to warn us about problems occurring in the monitored applications. In practice, the system warns us about a problem that has occurred – but the cost of using, maintaining and developing the application is often higher than planned.

Transcript of CIO Interview about Flopsar APM - Application Performance Management

Page 1: CIO Interview about Flopsar APM - Application Performance Management

Galaxy

or the escape from illusion

Michał Zabiełło

A new way to visualize system performance developed by a Polish company has been gaining recognition. The solution is already used by several dozen Polish companies and resolutely cuts through the well-known weaknesses of APM solutions.

One of the elements which may implement rational savings in IT is the group of tools for application performance management (APM). Large corporations are investing in purchases of APM tools. The providers of such solutions are implementing tens of dashboards, hundreds of graphs and flow diagrams. They define thousands of various alerts and inundate the mailboxes of relevant recipients with messages about the “health check” of business processes. This is designed to convince that the scattered IT infrastructure is under control. It all works until a serious malfunction occurs. IT specialists try to identify the cause of the problem, analyze millions of out-of-date, unnecessary or erroneous pieces of information coming from the implemented tools.

Bombarded by alerts

The tools to diagnose or monitor applications are of key importance. Good tools are expensive – they require many laboratory checks, tests, and a precise manufacturing process. Good and expensive tools are, in turn, complicated.

It is worth noting that such products have a specific methodology connected with performance management: we install a tool, configure the scope of reported metrics and build a complicated “health check” application to warn us about problems occurring in the monitored applications. In practice, the system warns us about a problem that has occurred – but the cost of using, maintaining and developing the application is often higher than planned.

Dashboards have become, paradoxically, the Achilles’ foot of those tools – every monitored application has to have a set of hierarchical dashboards, and each bit of information presented on it requires a set of defined SLA perimeters which allow to change the result of the “health check” – which is signaled by colors green, yellow, or red. This signaling is not unequivocal – it is not clear whether it means a failure of the system or just a slowdown, whether the problem concerns a single function or a whole set.

The tools are bombarding the administrators with information. The command center has its hands full with sifting and separating false alarms from those responsible for disruptions in data processing. The implementation specialists responsible for tools are constantly working on updating and adapting the dashboards to frequently changing applications or requirements concerning notifications about application problems.

Page 2: CIO Interview about Flopsar APM - Application Performance Management

The command center has its hands full with separating false alarms. The implementation specialists responsible for tools are constantly working on updating and adapting the dashboards to frequently changing applications or requirements concerning notifications about application problems. That is how APM operates.

In search of an intuitive APM

In 2012 a group of programmers experienced in implementing and administration of APM solutions formed a company. Its goal was to create a solution which would overcome the weaknesses and limitations of monitoring systems and increase the performance of applications. “Our point of departure in creating the system was a fundamental question: Do data from monitored systems, alerts and trends have to be represented in a way which requires huge outlays?” – says Grzegorz Pawluk, CTO and one of the co-founders of Flopsar Technology.

Perhaps it is possible to show in a simple, intuitive manner what is the most important for IT services:

that a malfunction has just occurred;

that the users may complain about the system working inefficiently; that the provider implemented a badly written application which cannot function in an

overloaded environment; that the application is using up too much of the power of the expensive equipment.

Those commonsensical assumptions are behind Flopsar (Flop Search and Rescue). The creators of Flopsar Suite asked themselves one more question “What is really important in the tangle of information reported from the monitored system?” And they formulated the following answers:

1. Simple implementation and no need for an advanced configuration: Plug-and-play.2. No need to train people who benefit from the tool.3. SIMPLE, intuitive interface (preferably one window).4. Maximum productivity - to discover a problem and to find its cause, the user should not need to

perform more than three operations.5. No “early warning systems” based on labor-intensive development.

Flopsar Galaxy

Flopsar does not aggregate data. It does noes not show averages, medians or quartiles. With unstable systems the sample is too large and therefore not credible. The galaxy shows EVERY single operation

Page 3: CIO Interview about Flopsar APM - Application Performance Management

Innovation can be seen in the approach to the project. The Flopsar project started with designing the infrastructure: messages, protocols, engines, data structure, mechanisms for load-balancing and bypassing the malfunction. The entire infrastructure was programmed in C language.

performed within the monitored system. Each time a transfer was performed or someone logged into an application, a dot would appear, located within the timescale of the event (axis X) and the response timescale (axis Y). The majority of “correct” times (the ones with sufficient processing quality) is concentrated within the lower registers of the galaxy. The dots form a multicolored plane there. If an application or its function has slowed down or malfunctioned, the dots migrate into the upper registers of the galaxy and form various concentration patterns. The fact that those concentrations appear in the galaxy is the reason for further investigation. The concentrations are automatically detected by a system based on artificial intelligence algorithms or may be marked manually in order to identify the reason for their occurrence. After marking, the user receives a precise diagnosis of what and why is not working correctly in the system.

After several days of working with the Flopsar system administrators begin to feel that they know what they see. Based on events observed in the past and interpreted concentrations they may say “the queue system got disconnected again,” or “web service is not working again” or even ignore the pattern as something natural.

The system works without configuration – there is no need to construct dashboards, to define static SLA for selected methods, to provide expensive system maintenance. Once the monitoring system has been switched on, the application server processes data, the monitor starts showing concentrations and the administrator starts looking for unnatural and disturbed concentration patterns.

Flopsar in UFG:production monitoring of critical applications

Reduction of production problems related to application performance

Code optimization – shorter response times

Reduced use of hardware infrastructure

How quickly does conclusion-making learn based on Flopsar visualization?

“We collect millions of data on policies, drivers and road events. It is critical to ensure the reliability and quality of operation of the IT systems which perform our statutory tasks. We selected the Flopsar Suite because of its intuitiveness and functionality. The tool was implemented within a few hours and its effective operation by the team of administrators started immediately after the implementation. The factors in favor of choosing Flopsar included also costs, the level of after-sales service, flexibility and the range of additional solution services offered by the provider. The data used from monitoring indicate unequivocally where the problem has occurred and, therefore, who is responsible for its servicing or repair. Today, we use the information obtained from Flopsar software in many cases as an argument in our negotiations with our IT service providers” – says Grzegorz Rymarski, IT Department Director, The Insurance Guarantee Fund (UFG).

Page 4: CIO Interview about Flopsar APM - Application Performance Management

Innovation through going back to the roots

Is the “galactic” way of showing data innovative and unique? Scatter-plot is used in statistics to visualize data. Grzegorz Pawluk explains: “Flopsar reports every transaction performed in the monitored system separately. It connects stack frames into stack traces and then reports the aggregated duration of the transaction as one point (with full access to all the remaining data). In this type of service, the volume of data which needs to be recorded in the monitoring base is gigantic. Therefore, it is the data base infrastructure (data persistence) and not data-generating agent which is the ‘heart’ of the Flopsar system.”

Innovation – or perhaps rather the return to healthy roots – can be seen in the approach to the project. The Flopsar project started with designing the infrastructure: messages, protocols, engines, data structure, mechanisms for load-balancing and bypassing the malfunction. The entire infrastructure was programmed in C language – the most efficient programming language. The code which has 5,000,000 lines was written from scratch and entirely without using any external (e.g. OpenSource) libraries. The engineers and Flopsar support are responsible for 100% of the solution. Tests and production implementation prove that Flopsar can process around 40,000 metrics per second or a cumulated load at the level of 200 MB/sec for a single data base instance in the 24/7/365 mode.

In 2013 Flopsar Technology implemented its solution as the only APM software provider on approximately 100 production application servers in the Polish market and in cooperation with strategic business partners it carried out several dozen projects to optimize critical systems. During the same period of time, the competitors have record a few individual license sales in Poland. At this time, the company, together with a number of partners is running a few Proof of Concept projects. “We estimate that until the end of 2014 the number of implementations will exceed 300 monitored application servers in mission critical-type systems. This will make Flopsar Technology an unrivalled market leader in the field of monitoring and managing the performance of critical applications based on Java servers” – says Grzegorz Pawluk. In the boxes you can see examples of using Flopsar at UFG and Generali – together with their top IT managers’ comments.

CIO Magazine asked Michał Zaremba, IT Infrastructure Project Manager, IT Department Support and Infrastructure Section, Generali Group, to comment on detailed changes related to the Generali Group APM solution implementation.

The Generali Group:

Sales management system production monitoring

Page 5: CIO Interview about Flopsar APM - Application Performance Management

Complete detection of all production issues (failures, delays, defects)

Full control over IT system production version acceptance – early issue detection, application code optimization suggestions, architecture and performance issue consulting

Code refactoring – processing optimization (performance increase)

Capacity requirement estimation for increased data processing periods

Flopsar Suite – Who should manage quality and efficiency?

Until recently Flopsar Suite was utilized by the Generali Group only for early detection of performance issues in production systems. It was handled by the team responsible for IT system and service monitoring. During performance testing developers were using it to discover inefficient methods and queries. Further experiences with the Flopsar Suite helped develop a different, more effective application performance monitoring model.

If you take a closer look at the tool, it is difficult to decide, whether this is an advanced application server performance monitoring system, or a reporting system designed for analyzing IT system operation performance. In the first case Flopsar may be perceived as just another monitoring system utilized in maintenance activities, and in the second case, as an additional system for supporting application development and service transition from the development to the maintenance stage. -However, one must realize, that in order to provide our customers with top value and performance, a very deep synergy of these areas is required. This also opens up extensive process optimization capabilities by eliminating unnecessary IT resource consumers, which provide no value to service recipients.

Department structure transformation and transition to a dev-ops concept enabled Flopsar Suite to finally end up in a spot, where its full capabilities may be utilized – in the hands of a team responsible for IT applications and services – both their development and operational activities. The important fact is that system utilization in both areas is very similar, and therefore requires no changes in team work style or mode, or any additional training.

Theoretical conclusions and diagnosis are supposedly delivered by Flopsar very quickly. How quickly, and have you been successful in transforming them into IT process and product optimization?

The use of Flopsar enables us to greatly improve the speed of handling incidents in a production environment. The time between an anomaly appearing in a production system, and corrective actions being launched by the team, is nearly null. In the past, if an end-user had a subjective feeling, that the system is not performing well, such information had to pass through multiple IT organization levels. Now this information is visible to an expert precisely when the user begins to feel the system becoming less responsive. All in all, the user reports problems to the service desk like before, but the service desk already knows about faulty system operations, and about an intervention being underway. This greatly cuts down on the time required to resolve incidents, due to being able to find the problem-causing method, service, or query in a quick and intuitive fashion.

Page 6: CIO Interview about Flopsar APM - Application Performance Management

Application development and test processes have also been optimized. Thanks to monitoring applications in development and test environments, we are able to discover operations with execution time beyond acceptable limits.

By analyzing the number of particular calls in a given period of time we are able to define business activity patterns, and as a result, properly manage IT service capacity, performance, and demands. This also enables us to properly schedule change management processes, including planned maintenance outages.

Based on those patterns and query statistics, is it possible to optimize other organizational processes and activities? Can the solution become a source of other innovations?

If the business process is performed in an IT system, which is covered by Flopsar analysis, all system operations are registered, and may be analyzed. Specific data visualization enables us to establish business process activities which are performed inefficiently.

Usually a business process performed in an IT system is treated by a business user as an operation with a definite start and end. In reality, this process includes multiple operations which reach beyond the application, towards the integration architecture, the database, and other systems. Advanced BPM systems feature a Business Activity Monitoring (BAM) component, which may be utilized to optimize business processes. However, if applications are developed in-house, a business process monitoring tool should also be provided, which is supported by particular applications. If the owner decides not to implement such functionality in the developed application, database-based deduction may be helpful, which may be provided by the Flopsar system.

Has capacity demand forecast accuracy improved? Has this lead to optimizing infrastructure usage?

In terms of infrastructure optimization for application performance Generali relies on three base techniques: monitoring technical parameters of infrastructure components (using SNMP, WMI, etc.), optimizing load balancing, and application performance monitoring using the Flopsar Suite.

The first and second technique are known and used by many organizations, but only an analysis of correlations between all of the above provides a complete image for capacity forecasting. This may be done by translating technical parameters of infrastructure components to the execution time of an operation in a monitored application.

The character of recent Generali marketing activities required a temporary multi-fold capacity increase in Merkury 2.0 – the primary sales system utilized by Generali. At first, we considered linear server infrastructure component scaling. When testing the solution with Flopsar, it turned out, that there are multiple factors, which may greatly influence performance, and may be modified in order to increase system capacity. We noticed that standard load balancing techniques may have an adverse effect on the time required to perform operations by a single user. Load balancing conditioning based on infrastructure and system parameters enabled us to provide a solution, which featured the same efficiency for every user. Curiously, the tests have shown, that Flopsar Suite impact on environment load falls below 1–2%. Finally, after completing several optimizations, we have reached a state, where the system load increase could be handled without modifying the server infrastructure at all. After completing this marketing activity we were able to reduce that infrastructure.

Page 7: CIO Interview about Flopsar APM - Application Performance Management

How did the transition to the new method of observing sales efficiency go, especially in case of interpreting event distribution visualizations? Did the users easily reach a new deduction process?

Flopsar Suite is an intuitive package. The system is currently used by the IT department, but we are seriously considering sharing its data with business users, who might then use it to optimize business processes.

However, you have to consider the fact, that business users often require numerical data, not graphical presentations, in order to perform data analysis. If Flopsar was to be used for sales efficiency analysis, it would be good, if it had an option to provide results in a numerical format. For example: Departments responsible for sales care not only about how the system performance influences product sales, but also what the product search operation distribution is during particular hours, within given months or within the year.

The fact, that Generali reached such an advanced level of tool use proves, that the system is easy to handle. We also noticed, that the tool may be used in an even more optimized fashion, if additional expertise is gained pertaining to its operation: analysis, result interpretation, as well as building report extensions. It is worth mentioning, that all the data collected in the Flopsar database are available to our developers through a dedicated API.

Are process and factor complexity considered limitations for the application performance visualization method proposed by Flopsar? If so, how can this be circumvented?

Most probably everyone, who was ever responsible for IT system performance optimization, faced uncertainty, whether the system operates the same way between measurements, as during measurements. This is typical for systems, where performance is measured at established time periods. Flopsar analyzes every operation within the system. If we do not filter particular calls in a so-called -galaxy, every point represents one system call. If the processes performed are of high complexity, we are forced to operate on a large number of geometrically correlated points. In such case data analysis requires verifying particular calls amongst a larger number of those measured and presented. This might become a limitation due to the speed of data analysis by an expert. It may also adversely impact the application server load due to Flopsar collecting data. This can be circumvented, if we utilize techniques to exclude particular calls, which are outside our interest. It is possible to achieve at the system administration level, which enables monitoring to be developed individually for every application. -Another method to reduce the data, which do not require analysis, is an option to filter out minimum and maximum operation time in the analyzed system. Finally, in case of systems working on several application servers, we are able to change the point colors depending on the server. I believe, that it would be useful, if there was an option to define item colors in a custom fashion, e.g. based on the type of system operation or on the execution time.