AMD Instinct MI300X system optimization — ROCm Documentation (2025)

Applies to Linux

2024-10-18

20 min read time

This document covers essential system settings and management practices requiredto configure your system effectively. Ensuring that your system operatescorrectly is the first step before delving into advanced performance tuning.

The main topics of discussion in this document are:

  • System settings

    • System BIOS settings

    • GRUB settings

    • Operating system settings

  • System management

System settings#

This guide discusses system settings that are required to configure your systemfor AMD Instinct™ MI300X accelerators. It is important to ensure a system isfunctioning correctly before trying to improve its overall performance. In thissection, the settings discussed mostly ensure proper functionality of yourInstinct-based system. Some settings discussed are known to improve performancefor most applications running on a MI300X system. SeeAMD Instinct MI300X workload optimization for how to improve performance forspecific applications or workloads.

System BIOS settings#

AMD EPYC 9004-based systems#

For maximum MI300X GPU performance on systems with AMD EPYC™ 9004-seriesprocessors and AMI System BIOS, the following configurationof system BIOS settings has been validated. These settings must be used for thequalification process and should be set as default values in the system BIOS.Analogous settings for other non-AMI System BIOS providers could be setsimilarly. For systems with Intel processors, some settings may not apply or beavailable as listed in the following table.

Each row in the table details a setting but the specific location within theBIOS setup menus may be different, or the option may not be present.

BIOS setting location

Parameter

Value

Comments

Advanced / PCI subsystem settings

Above 4G decoding

Enabled

GPU large BAR support.

Advanced / PCI subsystem settings

SR-IOV support

Enabled

Enable single root IO virtualization.

AMD CBS / GPU common options

Global C-state control

Auto

Global C-states – do not disable this menu item).

AMD CBS / GPU common options

CCD/Core/Thread enablement

Accept

May be necessary to enable the SMT control menu.

AMD CBS / GPU common options / performance

SMT control

Disable

Set to Auto if the primary application is not compute-bound.

AMD CBS / DF common options / memory addressing

NUMA nodes per socket

Auto

Auto = NPS1. At this time, the other options for NUMA nodes per socketshould not be used.

AMD CBS / DF common options / memory addressing

Memory interleaving

Auto

Depends on NUMA nodes (NPS) setting.

AMD CBS / DF common options / link

4-link xGMI max speed

32 Gbps

Auto results in the speed being set to the lower of the max speed themotherboard is designed to support and the max speed of the CPU in use.

AMD CBS / NBIO common options

IOMMU

Enabled

AMD CBS / NBIO common options

PCIe ten bit tag support

Auto

AMD CBS / NBIO common options / SMU common options

Determinism control

Manual

AMD CBS / NBIO common options / SMU common options

Determinism slider

Power

AMD CBS / NBIO common options / SMU common options

cTDP control

Manual

Set cTDP to the maximum supported by the installed CPU.

AMD CBS / NBIO common options / SMU common options

cTDP

400

Value in watts.

AMD CBS / NBIO common options / SMU common options

Package power limit control

Manual

Set package power limit to the maximum supported by the installed CPU.

AMD CBS / NBIO common options / SMU common options

Package power limit

400

Value in watts.

AMD CBS / NBIO common options / SMU common options

xGMI link width control

Manual

Set package power limit to the maximum supported by the installed CPU.

AMD CBS / NBIO common options / SMU common options

xGMI force width control

Force

AMD CBS / NBIO common options / SMU common options

xGMI force link width

2

  • 0: Force xGMI link width to x2

  • 1: Force xGMI link width to x8

  • 2: Force xGMI link width to x16

AMD CBS / NBIO common options / SMU common options

xGMI max speed

Auto

Auto results in the speed being set to the lower of the max speed themotherboard is designed to support and the max speed of the CPU in use.

AMD CBS / NBIO common options / SMU common options

APBDIS

1

Disable DF (data fabric) P-states

AMD CBS / NBIO common options / SMU common options

DF C-states

Auto

AMD CBS / NBIO common options / SMU common options

Fixed SOC P-state

P0

AMD CBS / security

TSME

Disabled

Memory encryption

GRUB settings#

In any modern Linux distribution, the /etc/default/grub file is used toconfigure GRUB. In this file, the string assigned to GRUB_CMDLINE_LINUX isthe command line parameters that Linux uses during boot.

Appending strings via Linux command line#

It is recommended to append the following strings in GRUB_CMDLINE_LINUX.

pci=realloc=off

With this setting Linux is able to unambiguously detect all GPUs of theMI300X-based system because this setting disables the automatic reallocationof PCI resources. It’s used when Single Root I/O Virtualization (SR-IOV) BaseAddress Registers (BARs) have not been allocated by the BIOS. This can helpavoid potential issues with certain hardware configurations.

iommu=pt

The iommu=pt setting enables IOMMU pass-through mode. When in pass-throughmode, the adapter does not need to use DMA translation to the memory, which canimprove performance.

IOMMU is a system specific IO mapping mechanism and can be used for DMA mappingand isolation. This can be beneficial for virtualization and device assignmentto virtual machines. It is recommended to enable IOMMU support.

For a system that has AMD host CPUs add this to GRUB_CMDLINE_LINUX:

iommu=pt

Otherwise, if the system has Intel host CPUs add this instead toGRUB_CMDLINE_LINUX:

intel_iommu=on iommu=pt

Update GRUB#

Update GRUB to use the modified configuration:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

On some Debian systems, the grub2-mkconfig command may not be available. Instead,check for the presence of grub-mkconfig. Additionally, verify that you have thecorrect version by using the following command:

grub-mkconfig -version

Operating system settings#

CPU core states (C-states)#

There are several core states (C-states) that an AMD EPYC CPU can idle within:

  • C0: active. This is the active state while running an application.

  • C1: idle. This state consumes less power compared to C0, but can quicklyreturn to the active state (C0) with minimal latency.

  • C2: idle and power-gated. This is a deeper sleep state and will have greaterlatency when moving back to the active (C0) state as compared to when the CPUis coming out of C1.

Disabling C2 is important for running with a high performance, low-latencynetwork. To disable the C2 state, install the cpupower tool using your Linuxdistribution’s package manager. cpupower is not a base package in most Linuxdistributions. The specific package to be installed varies per Linuxdistribution.

Now, to disable power-gating on all cores run the following on Linuxsystems, run the following command.

cpupower idle-set -d 2

/proc and /sys file system settings#

Disable NUMA auto-balancing#

The NUMA balancing feature allows the OS to scan memory and attempt to migrateto a DIMM that is logically closer to the cores accessing it. This causes anoverhead because the OS is second-guessing your NUMA allocations but may beuseful if the NUMA locality access is very poor. Applications can therefore, ingeneral, benefit from disabling NUMA balancing; however, there are workloads wheredoing so is detrimental to performance. Test this settingby toggling the numa_balancing value and running the application; comparethe performance of one run with this set to 0 and another run with this to1.

Run the command cat /proc/sys/kernel/numa_balancing to check the currentNUMA (Non-Uniform Memory Access) settings. Output 0 indicates thissetting is disabled. If no output or output is 1, run the commandsudo sh -c \\'echo 0 > /proc/sys/kernel/numa_balancing to disable it.

For these settings, the env_check.sh script automates setting, resetting,and checking your environments. Find the script atROCm/triton.

Run the script as follows to set or reset the settings:

./env_check.sh [set/reset/check]

Tip

Use ./env_check.sh -h for help info.

Automate disabling NUMA auto-balance using Cron#

The Disable NUMA auto-balancing sectiondescribes the command to disable NUMAauto-balance. To automate the command with Cron, edit the crontabconfiguration file for the root user:

sudo crontab -e
  1. Add the following Cron entry to run the script at a specific interval:

    @reboot sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
  2. Save the file and exit the text editor.

  3. Optionally, restart the system to apply changes by issuing sudo reboot.

  4. Verify your new configuration.

    cat /proc/sys/kernel/numa_balancing

    The /proc/sys/kernel/numa_balancing file controls NUMA balancing in theLinux kernel. If the value in this file is set to 0, the NUMA balancingis disabled. If the value is set to 1, NUMA balancing is enabled.

Note

Disabling NUMA balancing should be done cautiously and forspecific reasons, such as performance optimization or addressingparticular issues. Always test the impact of disabling NUMA balancing ina controlled environment before applying changes to a production system.

Environment variables#

HIP provides an environment variable export HIP_FORCE_DEV_KERNARG=1 thatcan put arguments of HIP kernels directly to device memory to reduce thelatency of accessing those kernel arguments. It can improve performance by 2 to3 µs for some kernels.

It is recommended to set the following environment variable:

export HIP_FORCE_DEV_KERNARG=1

Note

This is the default option as of ROCm 6.2.

Change affinity of ROCm helper threads#

This change prevents internal ROCm threads from having their CPU core affinity maskset to all CPU cores available. With this setting, the threads inherit their parent’sCPU core affinity mask. If you have any questions regarding this setting,contact your MI300A platform vendor. To enable this setting, enter the following command:

IOMMU configuration – systems with 256 CPU threads#

For systems that have 256 logical CPU cores or more, setting the input-outputmemory management unit (IOMMU) configuration to disabled can limit thenumber of available logical cores to 255. The reason is that the Linux kerneldisables X2APIC in this case and falls back to Advanced Programmable InterruptController (APIC), which can only enumerate a maximum of 255 (logical) cores.

If SMT is enabled by setting CCD/Core/Thread Enablement > SMT Control toenable, you can apply the following steps to the system to enable all(logical) cores of the system:

  1. In the server BIOS, set IOMMU to Enabled.

  2. When configuring the GRUB boot loader, add the following argument for the Linux kernel: iommu=pt.

  3. Update GRUB.

  4. Reboot the system.

  5. Verify IOMMU passthrough mode by inspecting the kernel log via dmesg:

    dmesg | grep iommu
[...][ 0.000000] Kernel command line: [...] iommu=pt[...]

Once the system is properly configured, ROCm software can beinstalled.

System management#

To optimize system performance, it’s essential to first understand the existingsystem configuration parameters and settings. ROCm offers several CLI tools thatcan provide system-level information, offering valuable insights foroptimizing user applications.

For a complete guide on how to install, manage, or uninstall ROCm on Linux, refer toQuick start installation guide. For verifying that theinstallation was successful, refer to thePost-installation instructions.Should verification fail, consult System debugging.

Hardware verification with ROCm#

The ROCm platform provides tools to query the system structure. These includeROCm SMI and ROCm Bandwidth Test.

ROCm SMI#

To query your GPU hardware, use the rocm-smi command. ROCm SMI listsGPUs available to your system – with their device ID and their respectivefirmware (or VBIOS) versions.

The following screenshot shows that all 8 GPUs of MI300X are recognized by ROCm.Performance of an application could be otherwise suboptimal if, for example, outof the 8 GPUs only 5 of them are recognized.

AMD Instinct MI300X system optimization — ROCm Documentation (1)

To see the system structure, the localization of the GPUs in the system, and thefabric connections between the system components, use the commandrocm-smi --showtopo.

AMD Instinct MI300X system optimization — ROCm Documentation (2)

The first block of the output shows the distance between the GPUs similar towhat the numactl command outputs for the NUMA domains of a system. Theweight is a qualitative measure for the “distance” data must travel to reach oneGPU from another one. While the values do not carry a special, or “physical”meaning, the higher the value the more hops are needed to reach the destinationfrom the source GPU. This information has performance implication for aGPU-based application that moves data among GPUs. You can choose a minimumdistance among GPUs to be used to make the application more performant.

The second block has a matrix named Hops between two GPUs, where:

  • 1 means the two GPUs are directly connected with xGMI,

  • 2 means both GPUs are linked to the same CPU socket and GPU communicationswill go through the CPU, and

  • 3 means both GPUs are linked to different CPU sockets so communications willgo through both CPU sockets. This number is one for all GPUs in this casesince they are all connected to each other through the Infinity Fabric links.

The third block outputs the link types between the GPUs. This can either beXGMI for AMD Infinity Fabric links or PCIE for PCIe Gen5 links.

The fourth block reveals the localization of a GPU with respect to the NUMAorganization of the shared memory of the AMD EPYC processors.

To query the compute capabilities of the GPU devices, use rocminfo command. Itlists specific details about the GPU devices, including but not limited to thenumber of compute units, width of the SIMD pipelines, memory information, andinstruction set architecture (ISA). The following is the truncated output of thecommand:

AMD Instinct MI300X system optimization — ROCm Documentation (3)

For a complete list of architecture (such as CDNA3) and LLVM target names(such gfx942 for MI300X), refer to theSupported GPUs section of the System requirements for Linux page.

Deterministic clock#

Use the command rocm-smi --setperfdeterminism 1900 to set the max clockspeed up to 1900 MHz instead of the default 2100 MHz. This can reducethe chance of a PCC event lowering the attainable GPU clocks. Thissetting will not be required for new IFWI releases with the productionPRC feature. Restore this setting to its default value with therocm-smi -r command.

ROCm Bandwidth Test#

The section Hardware verification with ROCm showed how the commandrocm-smi --showtopo can be used to view the system structure and how theGPUs are connected. For more details on the link bandwidth,rocm-bandwidth-test can run benchmarks to show the effective link bandwidthbetween the components of the system.

You can install ROCm Bandwidth Test, which can test inter-device bandwidth,using the following package manager commands:

sudo apt install rocm-bandwidth-test

sudo yum install rocm-bandwidth-test

sudo zypper install rocm-bandwidth-test

Alternatively, you can download the source code fromROCm/rocm_bandwidth_test and build from source.

The output will list the available compute devices (CPUs and GPUs), includingtheir device ID and PCIe ID. The following screenshot is an example of thebeginning part of the output of running rocm-bandwidth-test. It shows thedevices present in the system.

AMD Instinct MI300X system optimization — ROCm Documentation (4)

The output will also show a matrix that contains a 1 if a device cancommunicate to another device (CPU and GPU) of the system and it will show theNUMA distance – similar to rocm-smi.

Inter-device distance:

AMD Instinct MI300X system optimization — ROCm Documentation (5)

Inter-device NUMA distance:

AMD Instinct MI300X system optimization — ROCm Documentation (6)

The output also contains the measured bandwidth for unidirectional andbidirectional transfers between the devices (CPU and GPU):

Unidirectional bandwidth:

AMD Instinct MI300X system optimization — ROCm Documentation (7)

Bidirectional bandwidth

AMD Instinct MI300X system optimization — ROCm Documentation (8)

Abbreviations#

AMI

American Megatrends International

APBDIS

Algorithmic Performance Boost Disable

ATS

Address Translation Services

BAR

Base Address Register

BIOS

Basic Input/Output System

CBS

Common BIOS Settings

CLI

Command Line Interface

CPU

Central Processing Unit

cTDP

Configurable Thermal Design Power

DDR5

Double Data Rate 5 DRAM

DF

Data Fabric

DIMM

Dual In-line Memory Module

DMA

Direct Memory Access

DPM

Dynamic Power Management

GPU

Graphics Processing Unit

GRUB

Grand Unified Bootloader

HPC

High Performance Computing

IOMMU

Input-Output Memory Management Unit

ISA

Instruction Set Architecture

LCLK

Link Clock Frequency

NBIO

North Bridge Input/Output

NUMA

Non-Uniform Memory Access

PCC

Power Consumption Control

PCI

Peripheral Component Interconnect

PCIe

PCI Express

POR

Power-On Reset

SIMD

Single Instruction, Multiple Data

SMT

Simultaneous Multi-threading

SMI

System Management Interface

SOC

System On Chip

SR-IOV

Single Root I/O Virtualization

TP

Tensor Parallelism

TSME

Transparent Secure Memory Encryption

X2APIC

Extended Advanced Programmable Interrupt Controller

xGMI

Inter-chip Global Memory Interconnect

AMD Instinct MI300X system optimization — ROCm Documentation (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 5977

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.