Trace is a lightweight tool that offers visualization and sophisticated, automated techniques to analyze system performance. ESI (TNO) explains the tool’s underlying concepts and describes two example applications at Océ.
Understanding, analyzing and ultimately solving performance issues in high-tech systems is a notoriously difficult task. Performance is a cross-cutting concern often affected by many system components. One simple yet effective way of gaining insight into performance is to capture execution traces that involve the most significant components and visualize them in a Gantt chart. Trace is a lightweight tool that offers this visualization. It also provides sophisticated, automated analysis techniques tailored to answering typical questions such as ‘What’s the bottleneck?’, ‘Are the throughput and latency requirements met?’ and ‘How can we improve performance?’.
One of ESI’s long-term ambitions is to provide methods and tools for achieving performance ‘by construction’. This guarantees performance properties on the left-hand side of the well-known V-model of development. It improves current state-of-practice in which performance properties often emerge late in the development process (on the right-hand side of the V-model) and thus are very expensive to fix if they’re not satisfactory. Research by ESI and many others has demonstrated over the past years that executable models (virtual prototypes and digital twins) and their analysis will play an important role in realizing this ESI ambition. Trace contributes here, as it can visualize and analyze both simulation traces in models and execution traces in real systems, through which engineers can acquire greater insight into their designs’ inner workings with respect to performance in the early phases of development.
Trace is available as an Eclipse plug-in. The input is a plain text file describing a set of what are called claims. A claim has a start time and an end time, a part of a resource that is claimed during this time and a number of user-defined attributes such as the name of the component making the claim. For example: ‘Component A claims 2 CPU cores from 0.020 ms to 0.040 ms’. The modelling of an execution is motivated by the Y-chart decomposition of systems into an application that’s mapped to a platform (a set of resources). Generating this kind of input can be straightforward. Adding a few lines of logging to the software may suffice.
Trace visualizes each claim as a coloured box. Figure 1 shows an activity view depicting an image processing software simulation trace in a prototype Océ printer. The x-axis shows time; the y-axis shows claim groupings, in this case according to software task name. The colours indicate the page being processed (the colours repeat, eg in the ‘IP0’ row, because the colour palette is limited). In addition to this activity view, Trace also offers a resource view (to complete the Y-chart decomposition), which displays resources on the y-axis.
In our collaboration with Océ, we explored the scalability of a component in prototype image processing software. This component processes the print job’s page data in three sequential stages: IP1, IP2 and IP3, each running on a quad-core CPU. The software allows multiple component instances to enable parallel page processing with the ultimate goal of increasing throughput at the cost of using more memory. We observed, however, that the scalability heavily depended on the input data. For some print jobs, the throughput scaled almost linearly up to three parallel instances, whereas for other print jobs the throughput was almost constant. To investigate this puzzling issue, we first added log points to the source code to generate Trace input. Next, we selected two representative examples of the observed extremes, ran them through the software configured with three parallel component instances and loaded their executions in Trace.
The resulting resource view showed that the non-scalable job loaded the CPU more heavily. We then applied Trace’s built-in resource-usage analysis and saw significant differences: the scalable job had more than four threads running on the CPU for only 6 per cent of the time, whereas the non-scalable job had more than four threads running for about 90 per cent of the time. This was caused by the relative processing times of IP1, IP2 and IP3, which produces more or less pipelining depending on the actual, data-dependent, values. Figure 2 shows the resource view and the resource-usage analysis for the scalable job. Trace enabled us to quickly understand this scalability issue.
In another prototyping project at Océ, we investigated the bottleneck in the software that processes images from a scanner. In several future scenarios, this software had to be faster. The question was how to achieve this. The software had six processing steps communicating data via buffers. We modelled this system in a discrete-event simulation model to explore the consequences of changing the steps’ key properties (such as their nominal speed). We visualized simulation traces with Trace (figure 1). We ran the tool’s built-in critical-path analysis, which estimates causal relationships between claims and calculates critical tasks that determine performance.
Figure 3 shows the critical-path visualization by Trace. The bright red colour indicates that the corresponding tasks are blocked because of resource unavailability. IP2 is waiting for buffer space. Increasing buffer size thus potentially improves performance. Experimentation with the model confirmed that increasing the buffer capacity improves throughput by 30 per cent. The Trace critical-path analysis enabled us to quickly find and resolve a system bottleneck.
Trace and its underlying concepts are easy to learn and domain independent; it’s often straightforward to obtain the necessary data and the tool’s application potentially has great benefits. The visualization alone is often very useful as it displays the parallel activities and their respective durations. The human brain can quickly identify patterns and anomalies in such views.
As traces grow in size, however, automated methods are needed to support the analysis. In addition to the resource-usage analysis and critical-path analysis, Trace has a feature to compare traces with each other. This can be useful to calibrate a model with a trace from the system that’s being modelled. The tool further supports the formal specification and verification of properties, such as an upper bound on latency or a lower bound on throughput, using temporal logic techniques. Trace’s analysis support distinguishes it from other Gantt chart viewers.
Trace has been developed in collaboration with ASML and Océ. The tool is (or has been) applied in several ESI research projects at Océ, Thales, Daf and Philips Lighting (Signify). ASML has adopted it in one part of its design workflow.