Introduction
============

This directory contains a set of prototype tools for inspecting and analyzing Streamline captures from the command-line. In particular the tool `analyze_capture` is provided which can be used to process Streamline captures and generate reports for the sampled functions and call paths. These tools are based on a new processing implementation that provides faster analysis than the current Streamline UI. They provide new features such as the ability to render per-function metrics and improved SPE derived metrics.

What are function attributed metrics?
-------------------------------------

Traditionally Streamline has supported two modes of sample based profiling; periodic sampling and event-based sampling. Periodic sampling provides a view of the profiled application, showing where time is commonly spent, but does not provide a means to identify why time is spent there or whether that time is being utilized efficiently. In contrast, event-based sampling allows the identification of regions of code that commonly trigger some event, but often lack the context necessary to judge the impact of that event on the overall performance of the application. Consider for example a function that is identified using event-based sampling as having a high number of cache misses; is this a cause for concern and should effort be spent optimizing that function? On the one hand, poor cache utilization is a common cause of performance bottlenecks, on the other hand the application may be compute bound and only a small fraction of its overall time is spent in this function. Equally this could be the core loop of the application, and though the number of cache misses is high, when compared to the number of cache accesses, the overall miss-rate is low.

Function attributed metrics provide a means to combine periodic sampling with a set of performance metrics that can be attributed to individual functions. This approach allows the identification of the most commonly used functions within an application, and for each provides a range of metrics that can be used to identify different CPU performance bottlenecks. Given the previous example, it is possible to identify both the percentage of overall time spent in the function and the significance of the cache miss (and many other) events.


Prerequisites
=============

For best results, it is recommended to use the provided Linux kernel patches, which modify the behavior of the Linux perf API to better support collecting function attributed metrics with the Arm PMU on 64-bit Arm architectures. Without these patches it is still possible to collect metrics, but with greater overhead and increased performance impact on the target application.

With the full set of patches applied it is possible to collect performance metrics for:

  * The whole system (system-wide mode).
  * Single threaded applications.
  * Multi-threaded applications and applications that spawn child processes.

Without the patches the following issues may be encountered:

  * Profiling relies on high frequency sampling which significantly increases capture size, particularly for system-wide captures.
    High frequency sampling requires switching from user to kernel mode and consequently has a significant impact on branch prediction and cache locality. Typically this manifests as related metrics being scaled such bad values look worse.
  * It is not possible to profile multi-threaded or multi-process applications other than by using system-wide mode, or by using the `--inherit poll` option of the collection agent (`gatord`).
    When polling is used, it is possible for short-lived threads/processes to be missed. There is also an increased chance that `gatord` will hit its file-descriptor resource limit causing the capture to be aborted.
    Using system-wide mode is recommended over `--inherit poll`, which should only be used for small applications with a limited number of threads.


Capturing Metrics with Streamline
=================================

For full details of how to connect to, configure and then capture from an Arm-based Linux or Android target, refer to the [Streamline User Guide] as well as the [Android Target Setup Guide] and [Linux Target Setup Guide].

Function attributed metrics are configurable, along side other performance counters in the Counter Configuration dialog within the UI. Refer to [Custom Counter Configuration] for specific details. These metrics will *not* be exposed for selection unless an appropriate mode of operation is selected; specifically they are not supported in application profiling mode when `--inherit yes` (which is the default) is specified to `gatord`.

To configure the set of metric and other performance counters, first start `gatord` on the target device according to the preferred mode of operation. For example:

  * To make a system-wide capture:

    `./gatord -S yes`

  * To profile a single threaded application:

    `./gatord -I no -A some-application app-args...`

  * To profile a multi-threaded or multi-process application using the provided kernel patches:

    `./gatord -I experimental -A some-application app-args...`

  * Or, without the provided kernel patches, using polling mode:

    `./gatord -I poll -A some-application app-args...`

*Note:* Gatord supports many different options. Refer to `./gatord --help` or the appropriate section in the Target Setup guides for a full list of configuration options.

Once the capture agent is started you may connect to it using the UI. Select the `TCP` option on the `Start` view. If the device is an Android device, connected by USB, and the `adb` daemon is correctly configured or available in the PATH, then the device should appear automatically in the device list. Likewise if the device is a Linux device and is available on the same subnet. Otherwise enter the device's host name or IP address (and optional `:port` suffix if modified from the default) in the provided text box and select the `Select counters` option to bring up the `Counter Configuration` dialog.

Use the provided filter to search for the word `metric` or scroll the list of available data sources and select the items you are interested in.

Full instructions can be found under [Starting a capture].

*Note:* CPU performance counters as used by the `Timeline` view permanently consume one programmable counter slot in each CPU's PMU making them unavailable for use by the function attributed metrics. For best results avoid selecting any CPU PMU counters. By default, when first launched, `gatord` will pre-select a default set of PMU counters for the `Timeline` view. These should be removed from the selected counter list.

Once the set of metrics is configured, the capture can be started as per the `Start capture` button.

Whilst the capture is running the `Live` view will be shown. Unless other performance counters were selected it is expected that this will show an empty timeline, which steadily increases as capture time passes. See [Live overview] for details.

During this time it is *recommended* you tick the box at the bottom of the `Live` view that is labeled `Download process images from target`.

When profiling a single application, rather than system-wide, the capture will stop automatically when the process finishes. Otherwise stop the capture manually using the appropriate option in the UI.

If the `Download process images from target` was enabled, you will be prompted to fetch any program images of the target. Select all relevant images, they will be copied off the target using `scp` or `adb pull` as appropriate.

Once the capture stops it is possible to amend the list of program images (executables, shared libraries, APK files, and split debug files) using the Analyze menu option. See [Re-analyze stored capture] for details.

*Note:* The Streamline UI can be used to analyze a capture with metrics, and it will show a list of metrics in the Call Paths and Functions views. The performance of the current Streamline data processing model is much worse than the new prototype tools included in this directory. As such it is not recommended to use this feature for large captures. This will be improved in a future release.

Analysis of a capture can be stopped by clicking on the square stop icon that appears on a capture during analysis in the list of captures shown in `Streamline Data` view.


Capturing Metrics with from the Command Line
============================================

It is possible to collect a capture using `gatord` without any involvement of the UI. This is referred to as `Local capture mode`.

To determine what counters and metrics are available on the target device run:

    `./gatord -S yes --print counters`

or

    `./gatord -I <yes|no|poll|experimental> --print counters`

To see the set available to system-wide mode, or in application mode.

You may pass `--print detailed-counters` to see descriptions for each counter.

To configure the selected counters and metrics use the `-C` argument, for example:

    `./gatord -I experimental -C ARMv8_Neoverse_N1_metric_cpi,ARMv8_Neoverse_N1_metric_retired_insns_percent -o some-capture.apc -A some-application app-args...`

which will capture the `CPI` and `% Retired Instructions` metrics for `some-application` into the capture named `some-capture.apc`.

Refer to `gatord --help` for a full list of configuration options that can be used in local capture mode.

*Note:* CPU performance counters are used by the `Timeline` view in the Streamline UI and permanently consume one programmable counter slot in each CPU's PMU making them unavailable for use by the function attributed metrics. For best results avoid selecting any CPU PMU counters. By default, when first launched, `gatord` will pre-select a default set of PMU counters for the `Timeline` view. These should be removed from the selected counter list.

Collecting program images
-------------------------

Currently `gatord` does not collect program images (executables, shared libraries, APK files, and split debug files) from the target device.

  * You can use the `analyze_capture` command's `--collect-images` argument to collect images for a capture from the machine that the `analyze_capture` command is run on. You must run `gatord` and the linux/arm64 version of `analyze_capture` on the same machine.
  * Alternatively, you will need to manually copy relevant images into `<capture.apc>/images/`, and split debug files into `<capture.apc>/images/debug-files/`. You can use the `analyze_capture` command's `--list-images` argument which will output the associated program images.


Analyzing a capture
===================

To process a capture file and produce a set of call-paths and functions reports use the `analyze_capture` command appropriate to your target OS/Architecture.

Typically use the following command:

    `bin/<os>/<arch>/analyze_capture --all-images --no-print-results --output <output-path> <path-to-capture.apc>...`

which will process the capture(s) and produce a set of CSV files into `<output-path>`.

These CSV files can then be examined in your spreadsheet application of choice, or otherwise post-processed, or can further be processed by the `reportformatter` tool. Refer to [reportformatter/README.md] for further details.

The tool will produce one file containing the functions table called `functions-<capture-name>-bt.csv` per processed capture, and optionally one file called `functions-<capture-name>-spe.csv` containing data from SPE samples if that feature was used. Likewise, a file called `callpaths-<capture-name>-bt.csv` will be produced containing full call-stacks.

The tool provides a range of options to filter the captured data. Refer to the subsequent section for a list of available options.


Provided tools
==============

analyze_capture
---------------

This tool takes a Streamline capture, and will analyze it and produce one or more CSV outputs containing the sampled functions and call stacks data within the capture. The tool will produce separate output for SPE samples and for sampled collected by the perf API (such as periodic or event-based sampling, or for metrics).

The tool can be used as follows:

    `bin/<os>/<arch>/analyze_capture <arguments...> <path.apc(s)...>`

  * Output control:

    - `--print-results`
      Enable printing the results of the analysis to the console.
    - `-o, --output <path>` / `--no-output`
      Enable/disable CSV output (where `<path>`) is the directory where the CSV files should be written.
    - `--use-inlines` / `--no-use-inlines`
      When enabled, and where a sample lands in an inlined symbol, used the inlined symbol rather than the symbol that it was inlined into.
      The default is `--no-use-inlines`.
    - `--all-images` / `--no-all-images`
      Enable/disable automatic use of all images files defined in the capture's `images` directory.
      When disabled, will only use images explicitly named in the capture's `session.xml` file, matching the behaviour of the Streamline UI.
      The default is `--all-images`.
    - `--all-jitdumps` / `--no-all-jitdumps`
      Enable/disable automatic use of all jitdump files defined in the capture's `jitdumps` directory.
      When disabled, will only use jitdumps explicitly named in the capture's `session.xml` file, matching the behaviour of the Streamline UI.
      The default is `--all-jitdumps`.
    - `--bottom-up`
      Generate call-paths starting from the leaf-function, rather than from the root function.
      The default is to output call-paths starting from the bottom of the stack (such as from `main`), descending outwards to each callee.
      When specified, the call-path instead starts with each leaf-function, descending outwards to each caller.
    - `--no-backtrace`
      Skip processing and output of samples collected from perf (periodic samples, EBS samples, metrics).
    - `--no-spe`
      Skip processing and output of samples collected from SPE.

  * Filtering:

    - `--pid <pid|name-regex>`
      Limit output to only processes matching the specified process ID or process name.
      Process IDs can be prefixed with `!` for example `--pid !123` to select everything other than PID 123.
      Can be specified multiple times.
    - `--tid <tid|name-regex>`
      As per `--pid` but can be used to filter by thread ID or name.
    - `--between <[from]-[to]>`
      Limit output to only include events in the specified time range (relative to the start of the capture).
      Time stamps are given in nanoseconds.
      Both `from` and `to` are optional; to select everything in the first second you may use `--between -1000000000`, likewise to select everything between the first second and the end of the capture use `--between 1000000000-`.
    - `--metric-nesting-threshold <n>`
      In order to improve the quality of metric data, `gatord` will, where possible, collect some additional PMU counters (specifically the branch-return event) and use that to filter metric samples where the collected PMU counters show some call into a nested function.
      A threshold is used to determine whether to filter a sample. The default value of `<n>` is 1.
      Increasing the number increases the likelihood of mis-attributing metric data, but a reduced value increases the likelihood that all samples for a given function are filtered, reducing the utility of the metrics.
      A larger value is typically useful where an application has many small leaf functions, where attributing there behaviour to the caller is unlikely to be problematic.

  * Other options

    - `--list-images`
      List the executable images used for samples in the capture.
      Use to identify executable images that should be copied off the target for use by analysis.
    - `--collect-images`
      Only available with the *linux/arm64* build of this tool. Will copy any images directly off the target into the capture.
      Must be run on the same machine as the original capture was made.
      The tool is unable determine if an executable image was modified between the time the capture was made and the subsequent analysis (for example by recompilation or system update). When using this option it is recommended to run the capture and analysis on the same target one after the other.
     - `-h, --help`
      Print the tool help page and exit.

  * Troubleshooting arguments:

    - `--clean`
      Delete any existing analysis data associated with the capture before re-analysis.
    - `--serialize`
      Do not run the capture post-processing and executable image post processing in parallel.
    - `--all-columns`
      By default the output will not include columns with all zero values, nor the raw PMU counters used to generate metrics.
      This option enables outputting those values.


report_formatter.py
-------------------

This tool can be used to post process the CSV output from `analyze_capture` to produce an easier to consume XLSX file with filtered and highlighted values according to user defined rules. Refer to [reportformatter/README] for further details.



[Streamline user guide]: https://developer.arm.com/documentation/101816/latest/
[Android target setup guide]: https://developer.arm.com/documentation/101813/latest/
[Linux target setup guide]: https://developer.arm.com/documentation/101814/latest/
[Custom counter configuration]: https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Counter-Configuration/Create-a-custom-counter-configuration
[Starting a capture]: https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Starting-a-capture.
[Live overview]: https://developer.arm.com/documentation/101816/latest/Capture-a-Streamline-profile/Live-view-overview
[Re-analyze stored capture]: https://developer.arm.com/documentation/101816/latest/Analyze-your-capture/Streamline-Data-view/Re-analyze-stored-capture-data


strobing-patches
----------------

This set of files contains the prototype event strobing kernel patches. These patches may be used by `gatord` to support efficient collection of per-function metrics.

To apply on top of v6.7, use

    `git apply v6.7-combined.patch`

or using patch:

    `patch -p 1 -i v6.7-combined.patch`

Backported versions for v5.15, v6.1, and v6.6 are provided in the respective files. NB: that the v5.15 patch has only been compile tested.

To integrate these patches into an RPM based distribution's kernel the following steps can be followed:

  * Remove any existing rpmbuild directory (or rename as appropriate)

        `rm -fr rpmbuild`

  * Fetch the kernel sources

        `yum download --source kernel`

  * Install the sources binary

        `rpm -i kernel-<VERSION>.src.rpm`

  * Enter the rpmbuild directory that is created

        `cd rpmbuild`

  * Copy the patch into the correct location

        `cp vX.Y-combined.patch SOURCES/9999-strobing-patch.patch`

  * Update the specs file

        `nano SPECS/kernel.spec`      # Or use your prefered editor

    * Search for the list of patches defined starting with `Patch0001` and append the line:

        `Patch9999: 9999-strobing-patch.patch`

    * Search for the list of patches defined starting with `ApplyPatch` and append the line:

        `ApplyPatch 9999-strobing-patch.patch`

    * Save the changes and exit the editor

  * Build all the kernel and other rpms:

        `rpmbuild -ba SPECS/kernel.spec`

  * Install the built packages:

        `sudo rpm -ivh --force RPMS/aarch64/*.rpm`

  * Reboot the instance

        `sudo reboot`

  * Check the patches are in use

        `ls -l /sys/bus/event_source/devices/*/format/strobe_period`

    Should list at least one CPU PMU device supporting the strobing features, for example:

        `/sys/bus/event_source/devices/armv8_pmuv3_0/format/strobe_period`

*Note:* These steps have been tested on Amazon Linux 2023, some variation may be required for other distros.
