In the previous opus here we have discussed using powerd daemon to optimize power consumption of the TrueNAS system based on Intel Xeon E-2600 v3 (Haswell) and earlier processors, discussed challenges with managing and scaling clocks from the OS, and yet ended up with somewhat unsatisfying solution.

However, today on a used market, one can acquire a v4 (Broadwell) or newer processor for a pocket change; and mainboards that support v3 often also support v4 as well make it a no-brainer upgrade for your aging system to bring Intel Speed Shift, and redefine approach to power efficiency.

Now instead of an operating system actively managing the cpu clocks and c-states, hardware calls all the shots, while OS can still provide hints and control balance between the performance and efficiency, even on a per-core basis.

Daemons, like powerd, are no longer necessary.

BIOS configuration

The items may be called differently, I’ll be using the Supermicro conventions, as this is a golden standard for TrueNAS systems.

Advanced Power Management Configuration

Head on to your system’s BIOS, CPU configuration → Advanced Power Management Configuration.

On this screen we want to Enable power technology, and also set Energy Performance Tuning to Enable. This will allow the OS to specify the desired performance/power balance. When this is turned on, BIAS setting will be grayed out. Turn on Energy Efficient Turbo as well:

Advanced Power Management Configuration
--------------------------------------------------------
Power Technology                                [Custom]
Enengy Performance Tuning                       [Enable]
Energy Performance BIAS setting.                [N/A]
Energy Efficient Turbo                          [Enable]

• CPU P State Control
• CPU HAPM State Control
• CPU C State Control
• CPU T State Control

CPU P State Control

This is where we will turn on EIST and set P-State coordination to HW_ALL – giving hardware full control:

CPU P State Control

EIST (P-States)                                 [Enable]
Turbo Mode                                      [Enable]
P-State Coordination                            [HW_ALL]

CPU HWPM State Control

In this section, we enable HWPM in Native mode. This will allow OS influence on the parameters, as opposed to Out of Bound control mode, where, depending on the version of the OS, it may not even load the driver, and enable autonomous C-States.

CPU HWPM State Control

Enable GPU HAPM                                 [CHAPM NATIVE MODE]
CPU Autonomous Cstate                           [Enable]

CPU C State Control

In that section, enable all the C-states. All of them, to the deepest one:

CPU C State Control

Package & State Limit                           [C6 (Retention) state]
CPU C3 Report                                   [Enable]
CPU C6 Report                                   [Enable]
Enhanced Halt State (C1E)                       [Enable]

CPU T State Control

Enable it too.

OS Configuration

Under SystemTunablesAdd add a LOADER tunable to set machdep.hwpstate_pkg_ctrl to 0. This will enable per-core control. We may not need this right a way, but it’s a good and recommended default.

Monitoring

First, ensure that the hwpstate_intel driver is active:

% sysctl dev.cpufreq.0.freq_driver
dev.cpufreq.0.freq_driver: hwpstate_intel0

The frequency levels and current frequencies may be somewhat bogus:

% sysctl dev.cpu.{0..7}.freq_levels dev.cpu.{0..7}.freq
dev.cpu.0.freq_levels: 3500/-1
dev.cpu.1.freq_levels: 3500/-1
...
dev.cpu.0.freq: 1197
dev.cpu.1.freq: 1197
...

But that’s expected—OS has little visibility into clock control—hardware is doing all the work. To monitor it, we can use Intel PCM tools.

Intel PCM tools

Click here for a short guide on building the PCM tool from source -- as the one installable with pkg is quite old, but we want all the fancy coloring and features

Create a temporary jail with networking

sudo iocage create -r 13.3-RELEASE -n temp
sudo iocage set dhcp=1 temp
sudo iocage set bpf=1 temp
sudo iocage set vnet=1 temp
sudo iocage console -f temp

Install tools, fetch, and build

Install git and cmake, and fetch and build the PCM tools like so:

env IGNORE_OSVERSION=yes pkg install -y git cmake

git clone --recursive https://github.com/intel/pcm && cd pcm
mkdir -p build && cd build 
cmake ..
cmake --build . --parallel 8

Copy the built binary and dependencies

Once this completes, you will have tools in the bin folder. Using ldd check the dependencies of the pcm tool:

# ldd -f "%p\n" pcm
/usr/lib/libexecinfo.so.1
/usr/lib/libc++.so.1
/lib/libcxxrt.so.1
/lib/libm.so.5
/lib/libgcc_s.so.1
/lib/libthr.so.3
/lib/libc.so.7
/lib/libelf.so.2

and copy them, along with the contents of the bin folder, to some external dataset:

mkdir -p /mnt/tools/pcm
cp -r . /mnt/tools/pcm
cp `ldd -f "%p\n" pcm` /mnt/tools/pcm

To launch the pcm from the host, that won’t have the required dependencies, those we copied along with the tool, it’s helpful to create a helper alias in your ~/.zshrc:

alias pcm="sudo zsh -c 'pushd /mnt/pool1/tools/pcm; LD_LIBRARY_PATH=. ./pcm'"

Once done, exit and destroy the temporary jail

sudo iocage stop temp
sudo iocage destroy temp

Launching the pcm tools presents with the endless dump of the current state of the CPU:

 Core (SKT) | UTIL | IPC  | CFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI |   L3OCC |   LMB  |   RMB  | TEMP

   0    0     0.27   0.62    3.08    1518 K   4048 K    0.62    0.56  0.0029  0.0078      984      404        0     65
   1    0     0.26   0.69    3.14    1592 K   4022 K    0.60    0.57  0.0028  0.0071     3072      445        0     65
   2    0     0.26   0.73    2.83    1421 K   3896 K    0.64    0.57  0.0026  0.0071     1272      403        0     66
   3    0     0.25   0.74    2.88    1491 K   3888 K    0.62    0.57  0.0028  0.0072     3672      421        0     66
   4    0     0.28   0.65    2.97    1434 K   4530 K    0.68    0.53  0.0026  0.0083      936      392        0     66
   5    0     0.25   0.70    3.05    1529 K   4053 K    0.62    0.56  0.0029  0.0076     3192      417        0     66
   6    0     0.28   0.65    1.20     815 K   1897 K    0.57    0.52  0.0038  0.0087      480      128        0     67
   7    0     0.29   0.76    1.20     934 K   2127 K    0.56    0.52  0.0035  0.0080     1440      164        0     67
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.27   0.69    2.52      10 M     28 M    0.62    0.56  0.0029  0.0076    15048     2774        0     61
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.27   0.69    2.52      10 M     28 M    0.62    0.56  0.0029  0.0076     N/A     N/A     N/A      N/A

 Instructions retired: 3729 M ; Active cycles: 5400 M ; Time (TSC): 3504 Mticks;

 Core C-state residencies: C0 (active,non-halted): 26.79 %; C1: 23.90 %; C3: 0.00 %; C6: 49.31 %; C7: 0.00 %;
 Package C-state residencies:  C0: 64.64 %; C2: 18.17 %; C3: 0.00 %; C6: 17.19 %; C7: 0.00 %;
                             ┌───────────────────────────────────────────────────────────────────────────────┐
 Core    C-state distribution│0000000000000000000001111111111111111111666666666666666666666666666666666666666│
                             └───────────────────────────────────────────────────────────────────────────────┘
                             ┌─────────────────────────────────────────────────────────────────────────────────┐
 Package C-state distribution│000000000000000000000000000000000000000000000000000022222222222222266666666666666│
                             └─────────────────────────────────────────────────────────────────────────────────┘
---------------------------------------------------------------------------------------------------------------

MEM (GB)->|  READ |  WRITE | LOCAL | CPU energy | DIMM energy | LLCRDMISSLAT (ns)| UncFREQ (Ghz)|
---------------------------------------------------------------------------------------------------------------
 SKT   0     1.42     1.03  100 %      17.91       9.58         164.08             1.18
---------------------------------------------------------------------------------------------------------------

Tuning

Of interest are CFREQ, CPU energy, DIMM energy, Core C-state residencies, and Package C-state residencies.

It’s useful to run some workload in the jail, such as stress-ng --cpu=N, where N is a number of cores worth of work to generate.

Then adjusting the epp parameter of the hwpstate_intel driver you can shift the preference from performance (0) to power (100) and assess the impact. Handy shortcut to set the epp values for a range of cores:

sudo sysctl dev.hwpstate_intel.{0..7}.epp=80

Try various values (30-80 is a good starting range) and test various workloads until you are satisfied with the performance/power tradeoff.

Note, the goal is low average power consumption, while ensuring sufficient performance for most common workloads, including a single threaded samba. Therefore, lower clocks are not always better: sometimes it’s more beneficial to complete the job quickly and go to sleep, than to stay awake for longer while performing slower. Keep an eye on C-state residency for both the package and individual cores – you want the CPU to sleep as much as possible; or just focus on the power consumption: because that’s what you want to ultimately optimize.

References