import site.body

eGPUs under Linux: an advanced guide

With Linux 4.13 just being released I decided to write up my experiences with getting its newly expanded Thunderbolt 3 to work. Most news outlets have focused on the new TLS support in the Linux kernel however recent changes to the Thunderbolt module have allowed it to work on non-Apple device and allow many laptops to attach high performance rendering and IO devices such as GPUs and High speed network cards.

As Thunderbolt 3.0 allows attaching USB devices as well as standard PCIe devices over a single cable. With the addition of a PCIe enclosure it is possible to attach a wide range of generic PCIe devices to your laptop without modification. By also reusing Thunderbolt approved USB Type C cables, adding these devices are typically as easy as attaching a USB docking station to your laptop and will provide new functionality and charging to your laptop at the same time.

Hardware

While none of the setup should be hardware dependent I have listed what I used at my end in case anyone wishes to duplicate the setup and avoid as many issues as possible

  • Lenovo X1 Carbon (gen 5/i7 7500U @ 2.7Ghz, 3.5Ghz turbo)
  • AORUS External gaming box (NVidia GTX 1070)
  • Debian Stretch with custom Linux kernel (4.13-rc3)

What is required to get a GPU working in Linux

  • Upgrade to a newer Linux kernel
  • Install the NVidia driver
  • Set up Thunderbolt in the bios to run in 'secure' mode
  • Set up bumblebee to run programs on the GPU

Compiling an updated kernel

Compiling Linux kernels is not as hard as many would think if you have a good config to start from. Luckily for us many Linux distributions provide the config file on how they set up the running kernel under /boot allowing us to duplicate and extend the currently running kernel with the modules we need

Note the steps below can screw up your computer and are provided for reference only, changes for your specific setup may need to be made and if you are unsure what these changes are then you may want to check with someone who does.

$ apt install build-essential bc # change me
$ mkdir -p tmp/linux
$ cd tmp/linux
$ wget https://git.kernel.org/torvalds/t/linux-4.13.tar.gz
$ tar xzf linux-4.13-rc7.tar.gz
$ cd linux-4.13-rc7
$ cp /boot/config-4.9.0-3-amd64 .config # change me
$ echo "CONFIG_THUNDERBOLT=m" >> .config
$ make deb-pkg # change me

Should that all complete without issue you will be left with multiple installable dpkgs in tmp/linux. The following deb files will set you up with a working Linux kernel and initrd as well as ensure that the headers for compiling the NVidia src code are present:

  • linux-headers-4.13.0*.deb # required for NVidia driver to compile
  • linux-image-4.13.0_*.deb

Once these are installed a simple reboot should start the newly installed kernel. If your machine has issues during booting or you have no graphical output then you may need to select the old kernel via the boot menu on startup of your machine and seek additional help.

NVidia Drivers

Installation of the NVidia drivers is a fairly straightforwardd procedure, under Debian compilation of the kernel driver is automated and requires no user input, if it has not been previously installed then the following command will set up everything you need:

$ apt install nvidia-driver mesa-utils

If this package or any related NVidia packages have been installed then you will need to get the OS to rebuild the kernel modules for the updated kernel. This can be achieved with the following command

$ dpkg-reconfigure nvidia-kernel-dkms

Authentication of devices

PCIe has access to a feature known as DMA which can be used to allow a device to write to the host's memory (unless an IOMMU has been set up) and can allow someone to hijack your PC. This ranges from dumping encryption keys in memory to simply crashing your lock screen and allowing the attacker to access your session. To prevent this, Thunderbolt has several different modes of operation that can be specified in the BIOS that trade off ease of use vs convenience

  • none: Thunderbolt PCIe devices are automatically set up and visible.
  • user: Thunderbolt PCIe devices require a user to 'OK' the connection before appearing.
  • secure: Thunderbolt PCIe devices require a user to 'OK' the device on first connection, after this devices will be automatically added.
  • dponly: Thunderbolt PCIe mode disabled.

When in doubt, 'secure' mode is recommend. Despite its name it is one of the more convenient and secure modes and only slightly less secure than 'user' mode

User mode setup

If you wish to allow a recently plugged in device to be used then it is simply a matter of 'authorising' the device by locating it under /sys/bus/thunderbolt and writing '1' to the authorized file as below

# echo 1 > /sys/bus/thunderbolt/devices/0-0/0-1/authorized

No further work will be required as Linux will detect the new device on the PCIe bus and load the relevant drivers for it, if available.

Secure mode setup

Secure mode extends the 'user' mode above, while devices can still be authorized in the above manner, secure mode extends this to allow you to write a 'key' to the device that is checked upon connecting the device. If the key matches on next connection then the device is automatically attached

To write the key to the hardware device use the setup below, this will generate a random key and save it to your current filesystem location for later use

$ openssl rand -base64 32 > keyfile
# cat keyfile > /sys/bus/thunderbolt/devices/0-0/0-1/key
# echo 1 > /sys/bus/thunderbolt/devices/0-0/0-1/authorized

The next time the device is plugged in, writing the key to the correct location and telling the device to verify the key is as simple as:

# cat keyfile > /sys/bus/thunderbolt/devices/0-0/0-1/key
# echo 2 > /sys/bus/thunderbolt/devices/0-0/0-1/authorized

If the device should reject authentication (due to an incorrect key) then the device can be reauthorized by writing '1' instead of '2' (authorized device rather than verify key)

It is possible to automate this process with udev rules however development is underway for a daemon to allow users to control and authorise Thunderbolt devices as well as perform upgrades of the firmware, see the end of this article for more info.

Bumblebee

Bumblebee is an application to handle dual GPUs in laptop devices in Linux. It is primarily intended to allow disabling/powering-down of a power hungry discrete GPU and instead make use of an integrated GPU when on power or for no 3d heavy applications.

Out of the box bumblebee works without issue and works identically to a dual GPU system. Bumblebee does require some minor setup to ensure that the correct GLX setup is selected and that GPU commands get sent to the correct graphics card. If this is not set up correctly then logging out and back in may cause your display manager to crash in which case the commands below will need to be run from a logged in console (Ctrl + F3)

# apt install bumblebee bumblebee-nvidia
# update-alternatives --config glx

Select the '/usr/lib/nvidia/bumblebee' option and things should start working. The following command should then work without issue and refer to the internal graphics card.

$ glxinfo | grep "OpenGL vendor"
OpenGL vendor string: Intel Open Source Technology Center

While it may seem odd that this says Intel instead of NVidia this is exactly the behaviour we want as we have not told an application to use the eGPU specifically and as such it has fallen back to the default GPU.

To run applications on the external GPU there is still a bit more setup work. We will first need to enable the bumblebee service as this may not yet have started up.

# systemctl start bumblebeed

You will also need to allow access to bumblebee for your user account. The command below will set up your account to talk to the server on next reboot and put a workaround in place to only allow access to the GPU for your username until you perform this reboot. This means you do not have to log out then log back in straight away but can do this at a later time with no additional security risks.

# chown `whoami` /var/run/bumblebee.socket
# adduser `whoami` bumblebee

Once this has started we can run glxinfo on the eGPU with the help of the primusrun command as shown below

$ primusrun glxinfo | grep "OpenGL vendor"
OpenGL vendor string: NVIDIA Corporation

By default this second display server is :8 under bumblebee. This can be handy to know in case you want to display to a monitor connected to the GPU and need to set them up, eg:

$ DISPLAY=:8 xrandr

In some cases I found that I needed to run glxgears in the background to keep the GPU X11 server open as the display server :8 is created on-demand. As a quick fix, glxgears was run in another terminal as shown below:

$ primusrun glxgears

Extending your Desktop

If you wish to extend your laptops X11 display on to the outputs of the eGPU then the intel-virtual-output command can be used. This will create VIRTUAL0 to VIRTUAL6 outputs that map to the outputs on the GPU. Keep in mind that this involves traversing the PCIe bus multiple times and has terrible performance and is not recommended for graphical heavy programs or video.

High Performance Rendering from the eGPU

In eGPU setups, PCIe bandwidth is at a premium. Thunderbolt 3.0 consist of a 4x PCIe 3.0 link at roughly 1GB/s per lane or 32Gbit. In addition 10GBits is reserved for USB and other peripheral bandwidth to prevent the PCIe card from swamping the bus over the Thunderbolt port and upstream of the Thunderbolt bus.

The default bumblebee setup sends the commands to render the image and then sends the image back over the PCIe bus and it is possibly to rapidly run out of bandwidth

To avoid this it is possible to attach an external display to the GPU as above but render directly to the connected display, avoiding copying the output back to the host computer and improving the amount of bandwidth available. In testing I have seen performance numbers go from 1400 to 2000, or roughly a 40% increase in performance.

To enable this it is recommend to set up bumblebee to always run in the background so we do not need to run glxgears (and burn 10% gpu performance) in the background, this can be done by editing the /etc/bumblebee/bumblemeed.conf file and ensuring that it is set up to persist in the background. Running primusrun glxinfo | grep -i nvidia is a simple way to start the server and ensure that it is running as expected

You may need to set up one of the outputs on the GPU, this can be done by forcing the screen setup program to run on the relevant display by prepending DISPLAY=:8 to the setup program, the cli setup program for X is shown below which will output a list of display information that can then be used to setup the outputs:

DISPLAY=:8 xrandr

There is one further complication with sharing the mouse and keyboard with the second X display, for testing I have used x2x which is available under Linux as shown below, alternatively a solution like synergy may be useful as x2x may have limitations.

$ x2x -to :8 -east -resurface

This will allow you to scroll off the right hand side of the screen and have the cursor appear on the second X11 server (in this case our attached screen)

There are further hardware firmware limitations that halve the host to device bandwidth that can further constrain GPU performance by up to 40% and require a firmware patch. How to flash this firmware are documented further below.

Upgrading the firmware under Linux

Thunderbolt has a standardised interface to update the firmware on chipsets at either end of the link. Luckily for us Linux exposes this interface directly under /sys allowing firmware updates with no special tools required. Care has been taken to allow recovery in the event of a corrupt firmware upgrade and firmware images are checksummed to ensure that only good firmware is booted, reducing the chances of causing permanent damage to your machine or devices.

Thunderbolt devices under Linux have 2 separate firmware areas, The Active firmware area (nvm_active1) and the non-active firmware area (nvm_non_active1) only the non-active area can have firmware uploaded to it. Once this area has been written to, the firmware can be authenticated after which the device automatically reboots and boots into the new verified firmware.

on my system, The AORUS shows up at /sys/bus/thunderbolt/devices/0-0/0-1 and the Thunderbolt port at /sys/bus/thunderbolt/devices/0-0/, as you can see the AORUS device is in a directory under the Thunderbolt host device indicating where it is plugged in (this becomes helpful if you have multiple devices connected).

In these directories you can see information relating to the device such as its name, key and who built it (device_name, key and vendor*) as well as the nvm_* directories corresponding to the firmware

To perform the update procedure grab the latest firmware version from reddit/Gigabyte. As I was interested in increased PCIe/GPU performance I selected the H2D firmware which helps speed up transfers from the laptop to the device as per eGPU.io I was glad to see that Gigabyte delivered a new firmware image in a significantly faster time frame than I anticipated. I was even more encouraged when presented with a zip file that contained the raw firmware without obfuscation or embedding, allowing me to directly copy the file with minimal fuss.

Verifying the currently running firmware

It is possible to determine which firmware you are running via the nvm_version file, however in the case of the H2D vs non-H2D firmware above, these both present the same version. Luckily the running firmware can be read from memory and hashed allowing us to determine which version is running. These hashes won't match the on disk firmware versions but will allow you to differentiate between running firmwares

On my device the non-H2D version gave me the following

$ md5sum /sys/bus/thunderbolt/devices/0-0/0-1/nvm_active1/nvmem
5e761bb6ba0d555d0b7699a2292b7148

While the H2D version gave me the hash shown below

$ md5sum /sys/bus/thunderbolt/devices/0-0/0-1/nvm_active1/nvmem
a00484ac3176c72e2b8a8272c6e3ab0c

Flashing the firmware

Inside both Zip files is a '.bin' file that represents the firmware (in my case AORUS_N1070IXEB_8GD_VER10H2D.bin md5:1ed8ee21f01595efee8914e40fe638ef) upgrading was as simple as:

# cat AORUS_N1070IXEB_8GD_VER10H2D.bin > /sys/bus/thunderbolt/devices/0-0/0-1/nvm_non_active1/nvmem
# echo 1 > /sys/bus/thunderbolt/devices/0-0/0-1/nvm_authenticate

After this the device disconnected, rebooted and then became visible again (after going through the authentication step above)

I then proceeded to verify the H2D bandwidth using my own H2D opencl benchmarking script but found that both firmwares provided similar numbers (2761MB/s). Due to not benchmarking pre-firmware upgrades I cannot confirm if my script is broken or if both firmwares contain the H2D fix and that the h@d firmware contains some additional tuning as per their notes on what each version is optimised for. If someone does benchmark before and after please let me know so I can update this article.

Some hidden gems I have found

While investigating the box and getting it to work I did a lot of low level probing and scanning of the hardware and found a couple of things that are potential avenues to look into for further features/performance:

  • While investigating the Linux driver I found reference to 'link credits', these are normally used to constrain bandwidth in high speed/high performance devices and lead me to believe that it may be possible to 'unreserve' the 10Gbits of bandwidth that is reserved for use by the USB devices connected to the Thunderbolt port

  • Intel has an Ethernet over Thunderbolt driver operating at 10Gbit/s, allowing you to connect 2 laptops together for high speed data transfer. I believe this is related to patches for the Xeon Phi which has an Ethernet over driver in the Linux Kernel. As I have only just acquired a second machine with Thunderbolt I have yet to test this out, however the code is available on Github for LWN Windows and Mac OSX also have support for this which I suspect is provided by the official Intel drivers

  • The AORUS exports a USB keyboard and mouse which I believe is a configuration error in the hardware and suppressed in the Gigabyte drivers for windows. Due to Linux's use of a generic usb-hid device these devices are visible. I believe these controlling devices for changing the LED settings and potentially turning them off via normal HID commands.

Future work

I am currently working on an authentication stack for Linux allowing users to authorise devices and perform automatic firmware upgrades of devices. Work is coming along slowly but updates should be coming shortly see Blitz_Works for more details.

I have not tested this with more than one device at a time. While I suspect it will work flawlessly I hope to acquire a second Thunderbolt 3 enclosure soon to test this and also do some testing into non-GPU devices.

Expect a review of the AORUS in a separate post in the next couple of days.