# Arm® Streamline CLI Tools User Guide

_Non-confidential. Copyright © 2024 Arm Limited._

## 1 Introduction

Arm Neoverse™ CPUs provide cloud and server workloads with an energy efficient
computing platform. These systems give high application performance and an
excellent operational price-performance ratio. To maximize performance, you can
tune your software for the underlying hardware. To do this effectively, you
need high quality performance data from the hardware, and performance analysis
tooling to capture and interpret it.

The Streamline CLI tools are native command-line tools that are designed to run
directly on an Arm server running Linux. The tools provide a software profiling
methodology that gives you clear and actionable performance data. You can use
this data to guide the optimization of the heavily used functions in your
software. Profiling with these tools is a three-step process:

* Capture the raw sampled data for the profile.
* Analyze the raw sampled data to create a set of pre-processed
  function-attributed performance counters and metrics.
* Format the pre-processed metrics into a pretty-printed human-readable form.

[Section 2](#concepts) of this guide introduces the fundamental concepts that
you need to successfully use the top-down methodology that the tools provide.

[Section 3](#using-tools) of this guide explains how you can use the tools to
capture, analyze, and format profiling data.

[Section 4](#custom-formats) of this guide explains how you create custom
format definitions that are used for custom pretty-printed data visualization.

[Section 5](#troubleshooting) of this guide provides troubleshooting advice for
commonly encountered deployment issues.

[Section 6](#resources) of this guide provides links to further reading for
Neoverse performance analysis, performance counters, and software optimization
guides.

### 1.1 Audience

This document is written for application developers who want to profile and
optimize software for Arm Neoverse-based systems. It assumes that you are
familiar with general computer architecture concepts, but does not assume any
detailed knowledge of Arm Neoverse CPUs.

Detailed profiling and counter guides exist for low-level developers, such as
compiler engineers, who are interested in tuning code generation for a specific
product microarchitecture.

<a id="concepts"></a>

## 2 Concepts

This section explains the essential concepts that you need to understand to
optimize software for a Neoverse CPU, and some useful background on the
analysis approach used by the Streamline CLI Tools.

### 2.1 Performance analysis

A simple formula for understanding the performance of a software application
is:

> Delivered performance = Utilization × Efficiency × Effectiveness

_Utilization_ measures the proportion of the total processor execution capacity
that is spent processing instructions. This is a measure of the hardware
performance.

_Efficiency_ measures the proportion of the used processor execution capacity
that is spent processing useful instructions, and not instructions that are
speculatively executed and then discarded. This is a measure of the hardware
performance.

_Effectiveness_ measures the implementation efficiency of the software
algorithm, compared to a hypothetical optimal implementation. This is a measure
of the software performance.

To get the best performance you must implement an effective software algorithm,
and then achieve high processor utilization and execution efficiency when
running it.

### 2.2 Abstract CPU model

The processing core of a modern Arm CPU is represented in this methodology as
an abstract model consisting of 3 major phases.

![An abstract CPU block diagram](docs_images/abstract-cpu-pipeline.svg)

We define the available performance of the core using the maximum number of
micro-operations (micro-ops) that can be issued to the backend each clock
cycle. This execution width is known as the issue slot count of the processor.

This simple model does not include a lot of detail. For the purposes of
optimizing software, most of the low-level microarchitecture is not that
important because software has little control over code execution at that
level.

#### 2.2.1 Frontend

The frontend phase represents instruction fetch, decode, and dispatch. This
phase handles fetching instructions from the instruction cache, decoding those
instructions, and adding the resulting micro-ops to the backend execution
queues.

Each CPU frontend microarchitecture exposes a fixed number of decode slots that
can decode instructions into micro-ops each cycle. The main goal of the
frontend is to keep these decode slots busy decoding instructions, unless
there is back-pressure from the backend queues because the backend is unable
to accept new micro-ops.

The frontend also implements support for branch prediction and speculative
execution. Predicting where program control flow goes next allows the frontend
to keep the backend queues filled with work when execution is uncertain.
However, incorrect predictions cause the cancellation of issued micro-ops on
the wrong path, and the pipeline might take time to refill with new micro-ops.

#### 2.2.2 Backend

The backend phase represents the execution of micro-ops by the processing
pipelines inside the core. There are multiple pipeline types, each of which can
process a subset of the instruction set, and be fed by their own issue queues.

An application will have uneven loading on the backend queues. Queue load
depends on the instruction mix in the part of the code that is currently
running. When optimizing your application, try to prioritize changes that will
relieve pressure on the most heavily loaded queue.

#### 2.2.3 Retire

The retire phase represents the resolution of micro-ops that are
architecturally complete. Measuring the number of retired instructions gives a
metric showing the amount of useful work completed by the processor.

Not all issued instructions will retire. Speculatively issued micro-ops will be
cancelled if they are shown to be on the wrong code path and are therefore not
required.

### 2.3 Profiling goals

Using the abstract CPU model defined in the previous section, we can define
some optimization goals, and associate these with behaviors in the three
pipeline stages.

The main goal of our performance analysis methodology is to attribute unused or
unneeded slot issues to specific causes, to give you feedback about what is
causing your software to run slowly.

#### 2.3.1 Retiring performance

The _Retiring_ metric defines the percentage of the slot capacity that is doing
useful work. This metric provides the hardware-centric “Utilization ×
Efficiency” measure that we proposed earlier, representing two of the three
aspects of software performance. Your goal when optimizing for the hardware is
to make this number as high as possible, which indicates the best usage of the
available processing resources.

A high retiring metric means only that you are using the available hardware
efficiently. There might still be optimization opportunities in the software
that improve the effectiveness of your algorithms, and you can use other
hardware metrics to guide this work.

Optimizations to consider for software that has a high retiring rate include:

* Reducing redundant processing in the algorithm.
* Reducing redundant data movement in the algorithm.
* Vectorizing heavily used functions that have not been vectorized.

#### 2.3.2 Frontend performance

The role of the frontend is to issue micro-ops fast enough to keep the backend
queues filled with work. Software is described as frontend bound when the
frontend cannot issue a micro-op when there is free space in the backend queue
to accept one. The _Frontend bound_ metric defines the percentage of slot
capacity lost to frontend stalls.

Consider the following optimizations for software that is frontend bound:

* Reducing code size.
* Improving the memory locality of instruction accesses.

#### 2.3.3 Bad speculation

In addition to instruction decode stalls, some percentage of the available
issue capacity is wasted on cycles that are used either recovering from
mispredicted branches, or executing speculative micro-ops that were
subsequently cancelled. The _Bad speculation_ metric defines the percentage of
slot capacity lost to these effects.


Consider the following optimizations for software that is suffering from bad
speculation:

* Improving predictability of branches.
* Converting unpredictable branches into conditional select instructions.

#### 2.3.4 Backend performance

Backend pipelines can stall, making the issue queue unable to accept a new
micro-op. This occurs due to the presence of a slow multi-cycle operation,
or a stalling effect such as a cache miss. Software is described as backend
bound when the backend queue cannot accept a micro-op when the frontend has one
ready to issue. The _Backend bound_ metric defines the percentage of slot
capacity lost to these effects.

Consider the following optimizations for software that is backend bound:

* Reducing the size of application data structures and data types.
* Improving the memory locality of data accesses.
* Reducing use of slow multi-cycle instructions.
* Swapping instructions to move work away from issue queues that are under the
  most load.

### 2.4 Performance counters

Arm CPUs include a Performance Monitoring Unit (PMU) that measures instances of
low-level execution events occurring in the hardware. These measurements are
useful for multiple purposes:

* Counting instructions or cycles is useful for sizing a workload.
* Counting SIMD vector instructions is useful for identifying whether a
  workload is taking advantage of the available hardware acceleration.
* Counting branch mispredictions or cache misses is useful for identifying
  whether a workload is triggering specific performance pathologies.

To make performance analysis easier, Arm has defined a standardized performance
analysis methodology for the Neoverse CPUs. This methodology defines a common
set of hardware performance counters, and how to use them to derive the
higher-level metrics that enable you to optimize your applications.

#### 2.4.1 The top-down methodology

The top-down methodology provides a systematic way to use performance counter
data to identify performance problems in an application.

The methodology describes performance using a simple hierarchical tree of
performance metrics. The basic metrics described for the abstract model provide
the root nodes of the tree. Additional levels of hierarchy below each node
provide a more detailed breakdown for causal analysis.

![The major top-down metrics](docs_images/topdown-tree.svg)

This hierarchical approach, with clear causal metrics, provides an intuitive
way to find and understand the microarchitecture-sensitive performance issues
that your software is triggering. Using this information, you can target the
problem with specific corrective actions to improve the performance.

One of the major usability benefits of the top-down methodology for software
developers is that the first few levels of the top-down tree do not require any
knowledge of the specific CPU you are running on. You can profile on any of the
Neoverse CPUs and get the same metrics, despite differences in the underlying
hardware design. This lets you focus on your software and improving its
performance, instead of worrying about which event to capture on a specific
CPU.

The deeper levels of the tree become increasingly hardware specific, which is
useful for developers who want to optimize very deeply for a specific
microarchitecture. For most common software optimizations these levels are not
necessary.

#### 2.4.2 Stall metrics

The most common causes of stalls are cache misses and branch mispredictions. To
make it easier to understand the impact of stalls, two forms of miss metrics
are given:

* _Miss rate_ metrics tell you the percentage of misses for that specific
  operation type. These metrics tell you how effectively a particular cache
  or prediction unit is performing.
* _Misses per thousand instruction_ (MPKI) metrics tell you how many misses of
  that type occurred, on average, when running 1000 instructions of any type.
  These metrics tell you how significant the impact of a particular type of
  miss is, given the instruction makeup of the program.

For example, you measure a _Branch mispredict rate_ of 45% when profiling,
which tells you that 45% of branches are mispredicted. This is a clear sign
that the branch predictor is struggling, so improving branches can be an
optimization candidate. However, when you check the _Branch MPKI_ metric you
see that you only have 0.8 mispredictions for every 1000 instructions in the
sample. Even though branches are not predicting well, optimizing will not bring
significant improvements because branches are only a small proportion of the
instruction mix.

#### 2.4.3 Function attribution

The top-down metrics provide a systematic approach to identifying performance
problems in your software, but this is only actionable feedback if the metrics
are associated with a specific location in the running program.

The Streamline CLI Tools implement function-attributed metrics by measuring the
performance counters over a small sample window of just a few hundred cycles.
This allows the tool to see the useful function-frequency signals in the
performance counter data that are lost with traditional 1ms periodic sampling.

To reduce the volume of data produced, our approach uses a strobing sampling
pattern with an uneven mark-space ratio. For example, we capture data for a 200
cycle window, but only do so once every 2 million cycles. This approach gives
us the high frequency data visibility that we need for function-attribution,
while keeping a low probe-effect on the running application and a manageable
profile data size.

**Note:** Support for strobing counter sample windows is a new capability for
the Linux Perf kernel driver, which is not yet available upstream. A kernel
patch is provided to implement this functionality.

### 2.5 Arm Statistical Profiling Extension

Arm CPUs can support the Statistical Profiling Extension (SPE), which adds
support for hardware-based instruction sampling.

When using SPE, the hardware triggers a sample after a configurable number of
micro-ops. It writes the sample data directly into a memory buffer without any
software involvement. This sampling is not invasive to the running program,
until software is needed to process a full memory buffer.

Each sample contains the program counter (PC) of the sampled operation, and
additional operation-specific event data. This event data provides additional
feedback about the execution of that operation. For example:

* For branch samples, the event data indicates if the branch was mispredicted.
* For load samples, the event data indicates which cache returned the data.

SPE provides a complementary technology to the traditional performance
counters, and the best results can be achieved by using both together.

<a id="using-tools"></a>

## 3 Using the Streamline CLI tools

The Streamline CLI tools are native command-line tools that are designed to run
directly on an Arm server running Linux.

Profiling with these tools is a three-step process:

* Use `sl-record` to capture the raw sampled data for the profile.
* Use `sl-analyze` to pre-process the raw sampled data to create a set of
  function-attributed counters and metrics.
* Use `sl-format.py` to pretty-print the function-attributed metrics in a more
  human-readable form.

![Streamline CLI tools workflow](docs_images/streamline-cli-workflow.svg)

### 3.1 Checking system compatibility

Before you begin, you can use the Arm Sysreport utility to determine whether
your system configuration supports hardware-assisted profiling.

Follow the instructions in this [Learning Path tutorial][1] to discover how to
download and run this utility.

[1]: https://learn.arm.com/learning-paths/servers-and-cloud-computing/sysreport/

The `perf counters` entry in the generated report will indicate how many CPU
counters are available. The `perf sampling` entry will indicate if SPE is
available.

You will achieve the best profiles in systems with at least 6 available CPU
counters and SPE.

The Streamline CLI tools can be used in systems with no CPU counters, but will
only be able to return a basic hot-spot profile based on time-based sampling.
No top-down methodology metrics will be available.

The Streamline CLI tools can give top-down metrics in systems with as few as 3
available CPU counters. The effective sample rate for each metrics will be
lower, because we will need to time-slice the counters to capture all of the
requested metrics, so you will need to run your application for longer to get
the same number of samples for each metric. Metrics that require more input
counters than are available cannot be captured.

The Streamline CLI tools can be used without SPE. Load operation data source
metrics will not be available, and branch mispredict metrics may be less
accurate.

### 3.2 Installing the tools

The Streamline CLI tools are available as a standalone download to enable
easier integration in to server workflows. To download the latest version of
the tool and extract it to the current working directory you can use our
download utility script:

```sh
wget https://artifacts.tools.arm.com/arm-performance-studio/Streamline_CLI_Tools/get-streamline-cli.py
python3 get-streamline-cli.py install
```

The script can also be used to download a specific version, or install to a
user-specified directory:

```sh
# To list all available versions
python3 get-streamline-cli.py list

# To download, but not install, a specific version
python3 get-streamline-cli.py download --tool-version <version>

# To download and install a specific version
python3 get-streamline-cli.py install --tool-version <version>

# To download and install to a specific directory
python3 get-streamline-cli.py install --install-dir <path>
```

For manual download, you can find all available releases here:

```sh
https://artifacts.tools.arm.com/arm-performance-studio/Streamline_CLI_Tools/
```

The `sl-format.py` Python script requires Python 3.8 or later, and depends on
several third-party modules. We recommend creating a Python virtual environment
containing these modules to run the tools. For example:

```sh
# From Bash
python3 -m venv sl-venv
source ./sl-venv/bin/activate

# From inside the virtual environment
python3 -m pip install -r ./<install>/bin/requirements.txt
```

**Note:** The instructions below assume you have added the `<install>/bin/`
directory to your `PATH` environment variable, and that you run all Python
commands from inside the virtual environment.

### 3.3 Applying the kernel patch

For best results we provide a Linux kernel patch that modifies the behavior of
Linux perf to improve support for capturing function-attributed top-down
metrics on Arm systems. This patch provides two new capabilities:

* It allows a new thread to inherit the perf counter group configuration of
  its parent.
* It decouples the perf event-based sampling window size from the overall
  sample rate. This allows strobed mark-space sampling patterns where the tool
  can capture a small window without using a high sample rate.

Without the patch it is possible to capture profiles. However, not all capture
options are available and capturing top-down metrics will rely on high
frequency sampling. The following options are available:

* System-wide profile with top-down metrics.
* Single threaded application profile with top-down metrics.
* Multi-process\thread application profile without top-down metrics.

With the patch applied it is possible to collect the following profiles:

* System-wide profile with top-down metrics.
* Single threaded application profile with top-down metrics.
* Multi-process\thread application profile with top-down metrics.

The following instructions show how to install the patch on Amazon Linux 2023.
You may need to adapt slightly to other Linux distributions.

#### 3.3.1 Manual application to the source tree

To apply the patch to the latest 6.7 kernel, you can use `git`:

```sh
git apply v6.7-combined.patch
```

or `patch`:

```sh
patch -p 1 -i v6.7-combined.patch
```

#### 3.3.2 Manual application to an RPM-based distribution

Follow these steps to integrate these patches into an RPM-based distribution's kernel:

* Install the RPM build tools:

  `sudo yum install rpm-build rpmdevtools`

* Remove any existing `rpmbuild` directory (rename as appropriate):

  `rm -fr rpmbuild`

* Fetch the kernel sources:

  `yum download --source kernel`

* Install the sources binary:

  `rpm -i kernel-<VERSION>.src.rpm`

* Enter the `rpmbuild` directory that is created:

  `cd rpmbuild`

* Copy the patch into the correct location. Replace the 9999 patch number with
  the next available patch number in the sequence:

  `cp vX.Y-combined.patch SOURCES/9999-strobing-patch.patch`

* Open the specs file in your preferred editor:

  `nano SPECS/kernel.spec`

* Search for the list of patches starting with `Patch0001` and append the line
  for the new patch to the end of the list. Replace 9999 with the patch number
  used earlier:

  `Patch9999: 9999-strobing-patch.patch`

* Search for the list of patch apply steps starting with `ApplyPatch` and
  append the line for the new patch to the end of the list. Replace 9999 with
  the patch number used earlier:

  `ApplyPatch 9999-strobing-patch.patch`

* Save the changes and exit the editor.

* Install the build dependencies:

  `sudo dnf builddep SPECS/kernel.spec`

* Build the kernel and other RPMs:

  `rpmbuild -ba SPECS/kernel.spec`

* Install the built packages:

  `sudo rpm -ivh --force RPMS/aarch64/*.rpm`

* Reboot the system:

  `sudo reboot`

* Validate that the patch applied correctly:

  `ls -l /sys/bus/event_source/devices/*/format/strobe_period`

  This should list at least one CPU PMU device supporting the strobing
  features, for example:

  `/sys/bus/event_source/devices/armv8_pmuv3_0/format/strobe_period`.

### 3.4 Building your application

Before you can capture a software profile you must build your application with
debug information. This enables the profiler to map instruction addresses back
to specific functions in your source code. For C and C++ you do this by passing
the `-g` option to the compiler.

Arm recommends that you profile an optimized release build of your application,
as this ensures you are profiling a realistic code workload. For C and C++ you
do this by passing the `-O2` or `-O3` option to the compiler. However, we also
recommend that you disable invasive optimization techniques, such as link-time
optimization (LTO), because they heavily restructure the code and make the
profile difficult to understand.

### 3.5 Capturing a profile

Use `sl-record` to capture a raw profile of your application and save the data
to a directory on the filesystem.

Arm recommends making a profile of at least 20 seconds in duration, which
ensures that the profiler can capture a statistically significant number of
samples for all of the metrics.

```sh
sl-record -C workflow_topdown_basic -o <profile.apc> -A <your app command-line>
```

This command uses the following options:

* The `-C` option provides a comma-separated list of counters and metrics to
  capture. The workflow-prefixed options in the counter list select a
  predefined group of counters and metrics, making it easier to select
  everything you need for a standard configuration. Using
  `workflow_topdown_basic` is a good baseline option to start with.

  To list all of the available counters and metrics for the current machine,
  use the command `sl-record --print counters`.
* The `-o` option provides the output directory for the capture data. The
  directory must not already exist because it is created by the tool when
  profiling starts.
* The `-A` option provides the command-line for the user application. This
  option must be the last option provided to `sl-record` because all subsequent
  arguments are passed to the user application.

Optionally, to enable SPE add the `-X workflow_spe` option. Enabling SPE
significantly increases the amount of data captured and the `sl-analyze`
processing time.

Captures are highly customizable, with many different options that allow you to
choose how to profile your application. Use the `--help` option to see the
full list of options for customizing your captures.



#### 3.5.1 Capturing a system-wide profile

To capture a system-wide profile, which captures all processes and threads,
run with the `-S yes` option and omit the `-A ...` application-specific
option and following arguments.

In systems without the kernel patches, system-wide profiles can capture the
top-down metrics. To keep the captures to a usable size, it may be necessary
to limit the duration of the profiles to less than 5 minutes.

#### 3.5.2 Capturing top-down metrics without the kernel patches

To capture top-down metrics in a system without the kernel patches there are
three options available.

To capture a system-wide profile, which captures all processes and threads, run
with the `-S yes` option and omit the `-A ...` application-specific option and
following arguments. To keep the captures to a usable size, it may be necessary
to limit the duration of the profiles to less than 5 minutes

To reliably capture single-threaded application profile, add the `--inherit no`
option to the command line. However, in this mode metrics are only captured for
the first thread in the application process and any child threads or processes
are ignored.

For multi-threaded applications, the tool provides an experimental option,
`--inherit poll`, which uses polling to spot new child threads and inject the
instrumentation. This allows metrics to be captured for a multi-threaded
application, but has some limitations:

* Short-lived threads may not be detected by the polling.
* Attaching perf to new threads without inherit support requires many new
  file descriptors to be created per thread. This can result the application
  failing to open files due to the process hitting its inode limit.

#### 3.5.3 Minimizing profiling application impact

The `sl-record` application requires some portion of the available processor
time to capture the data and prepare it for storage. When profiling a system
with a high number of CPU cores, Arm recommends that you leave a small number
of cores free so that the profiler can run in parallel without impacting the
application. You can achieve this in two different ways:

* Running an application with fewer threads than the number of cores available.
* Running the application under `taskset` to limit the number of cores that the
  application can use. You must only `taskset` the application, not
  `sl-record`, for example:

```sh
sl-record -C … -o … -A taskset <core_mask> <your app command-line>
```

**Note:** The number of samples made is independent of the number of counters
and metrics that you enable. Enabling more counters reduces the effective
sample rate per counter, and does not significantly increase the performance
impact that capturing has on the running application.

### 3.6 Analyzing a profile

Use `sl-analyze` to process the raw profile of your application and save the
analysis output as several CSV files on the filesystem.

```sh
sl-analyze --collect-images -o <output_dir> <profile.apc>
```

This command uses the following arguments:

* The `-collect-images` option instructs the tool to assemble all of the
  referenced binaries and split debug files required for analysis. The files
  are copied and stored inside the .apc directory, making them ready for
  analysis.
* The `-o` option provides the output directory for the generated CSV files.
* The positional argument is the raw profile directory created by `sl-record`.

Several CSV files are generated by this analysis:

* Files that start `functions-`: A flat list of functions, sorted by cost,
  showing per-function metrics.
* Files that start `callpaths-`: A hierarchical list of function call paths in
  the application, showing per-function metrics for each function per call path
  location.
* Files that end `-bt.csv`: Results from the analysis of the software-sampled
  performance counter data, which can include back-traces for each sample.
* Files that end `-spe.csv`:  Results from the analysis of the hardware-sampled
  Statistical Profiling Extension (SPE) data. SPE data does not include call
  back-trace information.

### 3.7 Formatting a function profile

The function profile CSV files generated by `sl-analyze` contain all the
enabled events and metrics, for all functions that were sampled in the profile.

Use `sl-format.py` to generate a simpler pretty-printed XLSX spreadsheet that
is suitable for human consumption.

```sh
python3 sl-format.py -o <output.xlsx> --bt-file <functions-*-bt.csv> [--spe-file <functions-*-spe.csv>]
```

This command uses the following arguments:

* The `-o` option provides the output file path to save the XLSX file to.
* [optional] The `--bt-file` argument is the `functions-*-bt.csv` file created by
  `sl-analyze`.
* [optional] The `--spe-file` argument is the `functions-*-spe.csv` file created by
  `sl-analyze`.

This formatter has several basic capabilities:

* Selecting and ordering the desired metrics columns.
* Filtering out low-value function rows by absolute or relative significance.
* Formatting metrics columns using short names for compact presentation.
* Formatting metrics cell colors using threshold rules to spotlight bad values.
* Emitting the data as an XLSX data table, allowing interactive column sorting
  and row filtering when opened in OpenOffice or Microsoft Excel.

[Section 4](#custom-formats) of this guide explains how you create and specify
custom format definitions that are used to change the pretty-printed data
visualization.

### 3.8 Using a formatted function profile

There is no right way to profile and optimize, but the top-down data
presentation gives you a systematic way to find optimization opportunities.

Here is our optimization checklist:

Check the compiler did a good job.

* Disassemble your most significant functions.
* Verify that the generated code looks efficient.

Check the functions that are the most frontend bound:

* If you see high instruction cache miss rate, apply profile-guided
  optimization to reduce the code size of less important functions. This frees
  up more instruction cache space for the important hot-functions.
* If you see high instruction TLB misses, apply code layout optimization,
  using tools such as [Bolt][2]. This improves locality of code accesses,
  reducing the number of TLB misses.

[2]: https://learn.arm.com/learning-paths/servers-and-cloud-computing/bolt/overview/

Check the functions that have the highest bad speculation rate:

* If you see high branch mispredict rates, use a more predictable branching
  pattern, or change the software to avoid branches by using conditional
  selects.

Check the functions that are the most backend bound:

* If you see high data cache misses, reduce data size, reduce data copies and
  moves, and improve access locality.
* If you see high pipeline congestion on a specific issue queue, alter your
  software to move load a different queue. For example, converting run-time
  computation to a lookup table if your program is arithmetic limited.

Check the most retiring bound functions:

* Apply SIMD vectorization to process more work per clock.
* Look for higher-level algorithmic improvements.

### 3.9 Data caveats

The Streamline CLI tools provide you with function-attributed performance
metrics. To implement this using the Arm PMU, we take an interrupt at the start
of the sample window to zero the counters, and at the end of the sample
window to capture the counters.

This pair of context-switches has an overhead on the running software. The
absolute value of some metrics can differ to the value that would be reported
if our sample was non-invasive. However, functions will rank correctly, and
trends are directionally accurate when showing the impact of an optimization.

Our methodology has three known side-effects that impact the metrics:

* At the start of the sample window, it takes some cycles to refill the
  pipeline when returning from the context switch. This means we retire fewer
  instructions in the sample window than normal steady-state execution.
* At the end of the sample window, issued instructions that are queued in the
  issue queue are cancelled to reduce the context switch latency. This means we
  see a much higher number of instructions that are speculatively issued but
  not retired than normal steady-state execution.
* The kernel code run at the start of the sample window places higher pressure
  on caches and other cache-like structures. However, for most software the
  impact of this is minor.

#### 3.9.1 Impacted metrics

We are aware of the following impact on the default top-down metrics shown in
the formatted report:

| Top-down metric | Impact                      |
| --------------- | --------------------------- |
| Retiring        | Reports lower than normal   |
| Frontend bound  | Reports higher than normal  |
| Bad speculation | Reports higher than normal  |
| Backend bound   | Reports lower than normal   |

<a id="custom-formats"></a>

## 4 Custom sl-format.py reports

The Streamline CLI analysis tool, `sl-analyze`, outputs a raw CSV file
containing all of the profiling metrics that were generated. For a complex
application, this is large and difficult to use for manual review.

The `sl-format.py` script provides a method to extract a filtered subset of the
data to an XLSX spreadsheet, including generation of interactive tables and
custom cell formatting. The presentation format is specified using a YAML
configuration file, allowing easy reconfiguration of the visualization.

The script requires either a backtrace-based profile, an SPE-based profile or both.
If both profiles are used, the script combines their results in the output file.

### 4.1. Passing a custom configuration

Optionally, you can pass a custom configuration file using the `--config`
argument.

```sh
python3 sl-format.py -o <out.xlsx> --bt-file <functions-*-bt.csv> [--config <conf.yaml>]
```

If no configuration is specified, a default presentation suitable for a profile
recorded using `-C workflow_topdown_basic` is used.

### 4.2. Configuration syntax

The configuration file is a YAML file containing an ordered list of metrics.
Each metric is presented as a column in the output table, with each identified
application function in the source application as a row.

#### 4.2.1 Basic syntax

Each metric must specify the data `src_name`, which is the column title in the
input CSV file. The metrics can optionally specify the `dst_name`, which is the
column name to use in the XLSX output. If no `dst_name` is specified, the
`src_name` is used.

```yaml
---
- series:
    src_name: symbol
    dst_name: Function
- series:
    src_name: "Metrics: Sample Count"
    dst_name: Samples
...
```

#### 4.2.2 Symbol name series

The raw function names (`src_name: symbol`) in the CSV include full parameter
lists, which can help disambiguating functions in software that makes heavy use
of operator overloading. In many applications this is not necessary and simply
clutters the visualization. You can set `strip_params: true` for the `symbol`
source column to discard parameters.

Arm recommends that the series for the function name, as well as other
string-like columns, include the `dtype: str` property. This property stops
empty cells in these columns being interpreted as a floating point NaN value.

#### 4.2.3 Symbol filtering

The raw data includes all symbols that were sampled during the profile. Many
symbols are often of low significance, with few samples compared to the overall
sample count. The formatted data can discard low significance rows to make the
data easier to use.

To enable filtering for a series, you can add the following options:

* `filter: significance`, and
* `min_row_significance: <val>` with a significance value between 0 and 1.

For example, a minimum significance of 0.01 in the "samples" column indicates
that any function with fewer than 1% of the total samples should be discarded.

#### 4.2.4 Column highlight styles

Data columns can include basic styling rules to help highlight cells with
values that are worth investigating.

##### `style: absolute_ramp_up`

```yaml
- series:
    style: absolute_ramp_up
    min_ramp: 2
    max_ramp: 4
```

This style is based on the absolute value of each cell. It increases from no
highlight for a value below `min_ramp` to a maximum intensity highlight for a
value above `max_ramp`.

##### `style: absolute_ramp_down`

```yaml
- series:
    style: absolute_ramp_down
    min_ramp: 2
    max_ramp: 4
```

This style is based on the absolute value of each cell. It increases from no
highlight for a value above `max_ramp` to a maximum intensity highlight for a
value below `min_ramp`.

##### `style: relative_ramp_up`

```yaml
- series:
    style: relative_ramp_up
    min_ramp: 0.9
    max_ramp: 1.0
```

This style is based on the value of a cell relative to the min/max range of the
column, where a threshold of 0.0 indicates the minimum value of the column and
1.0 indicates the maximum value of the column. The example above highlights
cells in the top 10% of the column range.

It increases from no highlight for a relative value below `min_ramp` to a
maximum intensity highlight for a relative value above `max_ramp`.

##### `style: relative_ramp_down`

```yaml
- series:
    style: relative_ramp_down
    min_ramp: 0.0
    max_ramp: 0.1
```

This style is based on the value of a cell relative to the min/max range of the
column, where a threshold of 0.0 indicates the minimum value of the column and
1.0 indicates the maximum value of the column. The example above highlights
cells in the bottom 10% of the column range.

It increases from no highlight for a relative value above `max_ramp` to a
maximum intensity highlight for a relative value below `min_ramp`.

##### `style: stdev_ramp_up`

```yaml
- series:
    style: stdev_ramp_up
    min_ramp: 0.5
    max_ramp: 1.0
```

This style is based on the value of a cell relative to the number of standard
deviations from the mean. A threshold of N indicates a value that is N standard
deviations from the mean. The example above starts highlighting cells that are
0.5 standard deviations higher than the mean, ramping to full intensity for
cells that are 1 standard deviation higher than the mean.

It increases from no highlight for a value below `min_ramp` standard
deviations, to a maximum intensity highlight for a value above `max_ramp`
standard deviations.

##### `style: stdev_ramp_down`

```yaml
- series:
    style: stdev_ramp_down
    min_ramp: -1.0
    max_ramp: -0.5
```

This style is based on the value of a cell relative to the number of standard
deviations from the mean. A threshold of N indicates a value that is N standard
deviations from the mean. The example above starts highlighting cells that are
0.5 standard deviations lower than the mean, ramping to full intensity for
cells that are 1 standard deviation lower than the mean.

It increases from no highlight for a value above `max_ramp` standard
deviations, to a maximum intensity highlight for a value below `min_ramp`
standard deviations.

<a id="troubleshooting"></a>

## 5 Troubleshooting

This section outlines some common problems that can be encountered when you
are deploying the tools, and their solutions.

### 5.1 Capture data size using -I poll is very large

Data size can be very large capture when capturing with `-I poll`. This occurs
because the polling mode is provided as a fallback for systems without the
kernel patches. To capture data for function-attribution without the patches,
the tool must use a very high sample rate which increases the captured data
size.

To avoid this issue apply the kernel patches, and run `sl-record` without the
`-I` option. The kernel patch implements strobed sampling, allowing us to
alternate between a fast "mark" window that is captured and a slow "space"
window that is skipped.

### 5.2 Capture data size using SPE is very large

SPE uses hardware sampling that allows a high sample rate. Captured data size
scales as "core count * sample rate * capture duration", which can result in
very large SPE data sets in systems with many CPU cores, using a high sample
rate, or with a long application duration.

When profiling long running applications with SPE, Arm recommends increasing
the size of the SPE sample window. Use the `-F` option to set the number of
micro-ops between samples (default 2500000).

### 5.3 Capture reports file descriptor exhaustion

The application can run out of file descriptors when capturing with `-I poll`.

This occurs because the polling mode is provided as a fallback for systems
without the kernel patches. To attach Perf counter groups to each new
application thread created, we must open a new file descriptor for each counter
group for every core in the system. This operation has O(counters * cores *
threads) complexity, and requires a very large number of file descriptors on
Arm Neoverse systems which can have both high core count and high thread count.

To avoid this apply the kernel patches, and run `sl-record` without the `-I`
option, which is equivalent to `-I inherit`. The kernel patch allows new child
threads to inherit Perf counter groups, avoiding the need to open new file
descriptors.

If it is not possible to install the kernel patches, you can increase the
number of allowed file descriptors per process:

```
sudo ulimit -n -H $((64*1024*1024))
sudo ulimit -n -S $((64*1024*1024))
```

### 5.4 Capture reports SPE Aux data missing in Streamline GUI

The Streamline GUI contains "SPE Aux data missing" markers in the Timeline view
when opening the capture. This occurs because a sample buffer overflowed before
`sl-record` was able to store the data it contained.

There are three changes you can make to mitigate this problem:

* You can reduce SPE sample rate.
* You can increase the size of the Perf sample buffer.
* You can increase the size of the internal `sl-record` sample buffer.

To reduce SPE sample rate, use the `-F` option to increase the number of
micro-ops between samples (default is 2500000).

To increase the size of the Perf sample buffer, you can either manually
increase the Perf mlock limit (value must be a multiple of 4):

```
echo <num_kb> | sudo tee /proc/sys/kernel/perf_event_mlock_kb
```

... or run `sl-record` as root, which bypasses the mlock limit.

To increase the size of the `sl-record` internal sample buffer, use the
`-Z <num_pages>` option to specify the number of 4K pages to use. A buffer of
this size is allocated per core in the system, so ensure that you leave enough
memory for your application to run on systems with many CPU cores.

### 5.5 Capture impacts application performance

Application performance is impacted by running under `sl-record` on a high core
count system.

This occurs because capturing and storing the profiling data has an overhead on
the running system, especially when multiplexing counters with Perf.

To mitigate this issue limit the running application to a subset of the CPU
cores, leaving a small number of cores free for `sl-record`. For example, on a
64 core system Arm recommends limiting the application to 60 cores.

<a id="resources"></a>

## 6 Additional resources

The following links provide additional information about optimizing for
Arm Neoverse CPUs.

The full specifications for the Arm top-down methodology:

* [Arm Telemetry Solution Top-down Methodology Specification][TSTMS]
* [Arm Neoverse N1 Performance Analysis Methodology][TSTMSN1]
* [Arm Neoverse V1 Performance Analysis Methodology][TSTMSV1]

[TSTMS]:   https://developer.arm.com/documentation/109542/0100/?lang=en
[TSTMSN1]: https://developer.arm.com/documentation/109198/0100/?lang=en
[TSTMSV1]: https://developer.arm.com/documentation/109199/0100/?lang=en

The performance counter guides for specific Neoverse products:

* [Arm Neoverse N1 PMU Guide][PMUGN1]
* [Arm Neoverse N2 PMU Guide][PMUGN2]
* [Arm Neoverse V1 PMU Guide][PMUGV1]
* [Arm Neoverse V2 PMU Guide][PMUGV2]

[PMUGN1]: https://developer.arm.com/documentation/PJDOC-466751330-547673/r4p1/?lang=en
[PMUGN2]: https://developer.arm.com/documentation/109710/r0p3/?lang=en
[PMUGV1]: https://developer.arm.com/documentation/109708/r1p2/?lang=en
[PMUGV2]: https://developer.arm.com/documentation/109709/r0p2/?lang=en

The software optimization guides for specific Neoverse products:

* [Arm Neoverse N1 Software Optimization Guide][SWOGN1]
* [Arm Neoverse N2 Software Optimization Guide][SWOGN2]
* [Arm Neoverse N3 Software Optimization Guide][SWOGN3]
* [Arm Neoverse V1 Software Optimization Guide][SWOGV1]
* [Arm Neoverse V2 Software Optimization Guide][SWOGV2]
* [Arm Neoverse V3 Software Optimization Guide][SWOGV3]
* [Arm Neoverse V3AE Software Optimization Guide][SWOGV3AE]
* [Arm Neoverse E1 Software Optimization Guide][SWOGE1]

[SWOGN1]:   https://developer.arm.com/documentation/PJDOC-466751330-9707/r4p1/?lang=en
[SWOGN2]:   https://developer.arm.com/documentation/PJDOC-466751330-18256/0003/?lang=en
[SWOGN3]:   https://developer.arm.com/documentation/109637/0000/?lang=en
[SWOGV1]:   https://developer.arm.com/documentation/pjdoc466751330-9685/6-0/?lang=en
[SWOGV2]:   https://developer.arm.com/documentation/PJDOC-466751330-593177/r0p2/?lang=en
[SWOGV3]:   https://developer.arm.com/documentation/PJDOC1505342170661452/r0p0/?lang=en
[SWOGV3AE]: https://developer.arm.com/documentation/PJDOC1505342170661453/0200/?lang=en
[SWOGE1]:   https://developer.arm.com/documentation/swog466751/a/?lang=en
