Improve RAAT Error recovery that results in unnecessary skipping

Alister_Sibbald · December 6, 2025, 12:07pm

PROBLEM STATEMENT:

Currently, when Roon Server suffers what it believes to be an unacceptable comms link to an endpoint, its recovery strategy is to kill the source stream.

The User Experience of this is that Roon skips the current track that is playing and attempts to move to the next track in the album/playlist. This next track often fails and the process repeats, resulting in a cascade of failed track skips until the system sorts itself out. This is a poor UX - and one which, as I will explain below - is entirely unnecessary.

REASON I BELIEVE THIS TO BE INCORRECT AND UNNECESSARY:

In this particular case, the cause of the error was Roon deciding that the link between the server and endpoint was unrecoverable poor. I won’t argue whether it was “unrecoverable” or not as it is not pertinent to this discussion. Let’s accept that - at least temporarily, the link was not viable.
In response to this temporarily unviable link, Roon decided to kill the stream. This is shown in the logs.
The result of killing the stream is that Roon moves to the next track and attempts to initiate playback of that track ACROSS THE SAME COMMS LINK ROON THINKS IS NOT VIABLE.
If the link isn’t viable, this approach is bound to fail - and it will keep on failing until the quality of the link improves to the point Roon deems it “viable” again. Roon will keep killing the stream and moving on to the next track.
This approach actually makes things worse, as every time the stream is killed, the endpoint will be required to discard some or all of the music it has buffered - and will have to ask the server to send it more data - over the same, non-viable link.
I hope we can agree that the error recovery approach should deal with the actual source of the error. This at least seems uncontroversial.
At present, the approach above does NOT deal with the source of the error. Again, I hope we can agree on this fact.

SUGGESTION FOR IMPROVEMENT:

There are no doubt numerous approaches one could take. Here is one suggestion:
If a link between the server and an endpoint is deemed non-viable, the endpoint should be allowed to continue playing until such point as it has exhausted its buffer. Once the buffer is empty, it can do nothing else but pause.
Even after pausing, the endpoint should be allowed to still call for more data.
The server should continue to attempt to send more data to the endpoint (as well as control signals, which I assume occupy the same comms link) for an extended period (from “a few” to “Many” minutes) to allow the comms link to recover from its temporary “glitch”.
If, after this time, the link is still “down”, the server and endpoint may decide to stop trying so much - and just remain paused, but not attempt so frequent recovery attempts.
If possible, endpoints should be permitted to create buffers as large as they wish. The bigger the buffer, the longer they can survive a poor link to the server before they run out of music.

THE USER EXPERIENCE I WOULD EXPECT AFTER THIS IMPROVEMENT

We would not have these skips - or cascades of skips any more - at least in response to comms link issues.
In response to comms link issues we would experience a pause in the music - and after some time, the music would re-start without any input from us.

REASONS NOT TO IMPLEMENT THE SUGGESTION:

This approach would break the synchronisation between multiple endpoints that were being sync’d by RAAT, if those other endpoints were not experiencing the same poor comms link (for example if they were ethernet connected, not over WiFi)
HOWEVER, one could either do the same pause on all synchronised devices, OR one could only implement this approach where we were NOT synchronising multiple endpoints
Either way, the approach of killing the stream in response to a poor comms link between server and endpoint is objectively not a correct approach, as it does not address the source of the error - and probably makes it worse.

Fernando_Pereira · December 6, 2025, 6:26pm

The flaw in your argument is that those failures are often caused by bursty packet loss. It’s not that the link is not viable in a persistent way, it’s that channel contention (in WiFi) causes random loss. In my experience, you can have effective operation for hours, and then sudden losses for a short period, maybe when your neighbor is running a torrent or something like that.

Henry_McLeod · December 6, 2025, 6:51pm

My instinct would be to fix the network.

Alister_Sibbald · December 7, 2025, 9:45am

Fernando - perhaps I’m misunderstanding what you’re saying - but I believe your point is exactly what I’m saying. Yes, it is temporary, bursty degradation of the network link - that IS my point. It WILL recv er after a time. But the way to respond to it is certainly not - as RAAT does at present - to skip onto the next track in the playlist - as though the problem were with the source of the current track…

Alister_Sibbald · December 7, 2025, 9:50am

Henry,

Of course, ideally, you want a wired connection. Nobody would argue with that - certainly I wouldn’t.

But we live in a world where we do use WiFi increasingly. Roon cannot realistically tell its customers to only use Ethernet.

If we do use WiFi, we have to accept that the link will occasionally degrade, for all sorts of reasons. I for example, have a very capable WiFi network - but as Fernando pointed out, even such a WiFi network will experience occasional degradation.

Roon must be able to gracefully cope with such occasional degradation. The way it tries to do it at present - by killing the stream and skipping to the next track is not a viable response to the issue and in fact makes the problem worse, as I explained in my initial post.

None of what I have proposed invalidates the fact that one is ALWAYS better off with a wired Ethernet connection.

Henry_McLeod · December 7, 2025, 2:00pm

I’ve tried a number of solutions as my core is upstairs in my listening space. WiFi repeaters, Ethernet over mains and, most recently hanging everything off a WiFi6 hub into a local switch. It has been the most reliable solution and I can say with honesty I simply haven’t seen the issue since connecting this way. WiFi is configured as half duplex as I’m sure you know. It results in latency which can contribute to the problem you reference. My view is a dedicated WiFi 6 hop is good enough for Roon as a system (core, library and optionally DSP like HQPlayer). However multiple or shared hops have the potential to create problems that might require additional management.

If I’m honest, a symptom of HQPlayer which I hate is its dropouts when you reach the performance limits of the hardware in your system. I’m glad Roon deals with network hiccups differently, and that when it does play up there are some diagnostic tools that allow you to decide what to do. So I guess what you propose would be OK if Roon then told you what was wrong. Otherwise I’m not convinced it wouldn’t just create more complaints.

Alister_Sibbald · December 7, 2025, 3:15pm

Henry,

Honestly, I don’t expect to get anywhere with what amounts to a request to change something as core as RAAT. But I felt I had to make the point.

From a practical perspective, the conclusion I’ve come to is that I need to run Ethernet to any Roon endpoints as it cannot be relied upon to deal with poor quality wifi in an acceptable manner.

This is a pity, as all the other streamers manage it. And with my Network Protocol Coder hat on (that I haven’t worn for decades…) it doesn’t appear very difficult to do…

Fernando_Pereira · December 7, 2025, 6:05pm

The underlying issue is that the receiver is starved of audio packets as WiFi contention causes a burst of retransmits. So, the receiver cannot continue playing without discontinuity. In practice, that means play glitches, which I’ve experienced when listening in my backyard over WiFi sometimes, but eventually the delay is too long, and Roon skips to the next track. Not clear what else it could do in this situation, it’s not as if the receiver can fill in the gaps with something reasonable.

The situation is different with lossy streaming (audio or video) where the link can drop down to lower bit rates to adapt to lower effective link performance. But that’s not Roon’s design point.

Alister_Sibbald · December 8, 2025, 11:42am

We’ll have to disagree on this. I still can’t see how Roon skipping onto the next track solves anything here.

The issue is NOT that there is a comms difficulty between the Roon server and the source of the track (which may be a local drive, or a remote streamer over the internet). There is NO difficulty with sourcing the track data and hence no reason to skip that track.

The issue is getting the data the server has over to the end point. In this case, the logical tactic is for the endpoint to simply wait until the connection to the server improves - i.e. pause and restart when its buffer re-fills.

Henry_McLeod · December 9, 2025, 10:11am

Interestingly when I have experienced this, it has always been when using Tidal in my present configuration (library on the same box) or when the library is on a NAS with a slow WiFi hop in the path. It is possible we are mixing symptoms and causes.

Alister_Sibbald · December 15, 2025, 8:07pm

Actually, I think this error recovery strategy - of skipping to the next track - seems to be utilised in a number (all?) error situations. Certainly, I have seen it happen when the server can’t contact Qobuz - which also doesn’t make sense. If Qobuz is gone, trying to skip to the next track (presumably also Qobuz) is not a sensible strategy either.