On March 6, 2024 Roon released version 2.0.30 which included build 1382 of Roon Server. In the day following the release our support team received reports of connectivity issues from a very small subset of the Roon user community. Upon investigation it was discovered that a small, maintenance-focused change in Roon was resulting in some devices being unable to establish a connection with Roon’s cloud services as well as the cloud services of our streaming partners (TIDAL, Qobuz, and KKBOX). Due to the late release time on March 6, many of these user reports did not start appearing until the morning of March 7 (US EDT).
Our support, QA, and engineering teams sprang into action to investigate the issue and restore connectivity to the affected users. While we were able to determine the cause of the failure early on in the process, its true nature made resolution quite difficult. Ultimately, the team was successful and by the evening of March 7 we had a workaround developed and ready for testing. By the morning of March 8 we received confirmation that the fix was successful and impacted systems were able to communicate again.
With instructions in hand our support team has been working to disseminate information on the workaround to the impacted users.
What happened?
As far back as August 2023 our engineering team has been revamping and improving the way in which Roon talks to our cloud services and streaming partners. This has been a long-term project that has touched every aspect of our infrastructure and has had the benefit of providing improved performance and more reliable connectivity. This change has been in place with some parts of our infrastructure for many months and the final stage of the project was to release it to our users in production, which happened in build 1382.
This type of maintenance work is always ongoing at Roon and any given release may contain dozens of these housekeeping changes. More impactful changes, such as this one, go through significant testing in order to ensure a smooth transition for our users. In this particular case the transition went as expected, but a very small group (less than 0.1% of our user base) reported a loss of connectivity that was not experienced in testing.
Background
Roon uses industry-standard protocols for network communication and we rely on the underlying operating system to manage this communication. This is no different than any other application running on the device. In other words, we’re doing the same thing as a web browser or mail client. Similar to any other application developer, we assume that the host computer is properly configured and reasonably up-to-date.
Our change involved moving to a newer and more modern toolset for managing network connections. There is nothing remarkable about this code and it is in use by countless other software packages on a daily basis.
What went wrong
In our development and testing we assumed that the devices running Roon Server would have operating systems that are reasonably up-to-date and at least have a somewhat recent set of what OS vendors call “critical and security updates” applied. What we discovered in this outage is that there are a subset of devices running Roon Server which are woefully out-of-date.
Please keep in mind that the deficiency in these systems will impact any application attempting to apply modern connectivity and security best practices, not just Roon.
The vast majority of these devices are purpose-built music servers, NAS devices, or custom Linux distributions. In all cases these are seen as appliances which typically run a limited suite of applications, and this limited use served to mask the deficiency in their underlying operating system code.
When the updated version of RoonServer started on these devices the new networking code was not able to establish a connection to cloud services due to out-of-date components of the operating system. This resulted in symptoms ranging from update errors, to search failures, to lack of connectivity to streaming services.
The fix
We are currently working with impacted users to implement a fix which will revert them back to Roon’s prior network connectivity method. This is a simple process and our support team is working with users who need assistance with getting their systems patched.
Going forward
While it is reasonable for us to assume that all devices running Roon meet a baseline for connectivity and security, we do understand that many of our customers have significant investments in specialty hardware and we don’t want to abandon them or force them to abandon their hardware. The entire Roon team is currently working on a long-term solution to accomplish the following:
-
We will be in contact with our partners who distribute server products to inform them of the deficiencies in their operating systems and the steps they will need to take to rectify. We expect this to result in updates from the server manufacturers which will eliminate this issue entirely.
-
Our engineering team is threading the needle on some code changes which will allow devices with these security and connectivity deficiencies to continue to function as long as possible.
-
We are also investigating some other changes to Roon’s infrastructure to better address incidents like this in the future.
If you own an impacted device then you can also help yourself by reaching out to the manufacturer or operating system distributor for instructions on how to get the most up-to-date networking, security, and encryption updates for your product. This should include any updates to the following:
- Packages related to basic TCP/IP networking
- Web retrieval tools such as
curl
orwget
samba
(windows file sharing)ntp
(network time protocol)- Name resolution tools
- up-to-date SSL CA root certificates
The devices that we noted during our investigation have operating system components that are at least 4 years out of date and, in some cases, much more than that. Given the nature of internet threats and security, those systems are dangerously out-of-date and could present a significant security risk to the networks to which they are connected.
Again, we do apologize for the inconvenience for those who were impacted. We strive to provide a 100% positive experience for all of our users and are sorry to have missed that goal in this case.