Playback instability when changing sample rates [dCS Network Bridge]

Link from most recent log folder:

Hey @stevebythebay,

@danny, @noris, and I have been looking over your issue and trying to get a clear sense of whatā€™s going on here, and why itā€™s been going on for so long. Iā€™ll be discussing this with our QA team on Monday as well, and I think your frustration about how long this has taken is valid.

There are a few things I want to be clear about here as we move forward and make sure we get this resolved for you.

The first is that weā€™re curently looking into some memory usage issues that can affect Roon on Linux. Obviously, we have a huge number of users running on Linux, so one major component of our investigation has been identifying why this seems to affect some users and not others.

To be clear, the tests weā€™ve been asking you to run are not about asking you to change your setup. The idea is that something is making your setup less stable than other linux Cores, and there are a number of possible factors here - things like the contents of your library, type of storage used, networking gear used, networking settings in place, etc. By changing these elements one-by-one and seeing when things get more stable, weā€™re generally able to isolate key factors and resolve issues.

Ultimately, the goal is to identify the key factor or factors, and make this bug happen in our dev environment so we can understand the issue and resolve it. We know these tests can be time consuming, so I appreciate your patience with the various configuration weā€™ve asked you to try. Our intention was never to have you change your setup long term.

The good news is that weā€™ve made some progress on this issue over the last week or so, and hopefully we are getting close to the point at which weā€™ll be able to put a fix in place, test it, and release it. We will keep you updated on our progress, but obviously this is still an active investigation.

The second point I wanted to make is that reading back over this thread, I think I could use some clarifications about the exact symptoms and testing thatā€™s taken place. Apologies if I am making you repeat yourself, but I want to make sure weā€™re 100% clear here so we can resolve this for soon.

  • The issue with the dCS network bridge after sample rate changes is different from the more generalized ā€œdeath spiral issueā€ correct?

    • For both issues, how often do they occur? Once every 30 mins of listening? Once every 5 hours of listening? Do they appear at the same time, or completely unrelated?

    • I assume you canā€™t reproduce this instability at will right? It happens randomly?

  • When you say ā€œplayback instabilityā€, can you explain exactly what you mean? Some helpful details would be to know:

    • Is there an error? Or does playback just stop?

    • Do your remotes lose connectivity when playback stops? If you use more than remote, do they all fail to connect when this instability happens, or are they all staying connected? Or a mix?

    • Does this only happen with the dCS network bridge? Or have you seen this happen while playing to other zones?

  • Have you tested with a simplified network setup?

    • Ideally, you would put a couple files on a USB stick, plug it into the Nucleus, connect the Nucleus and dCS network bridge directly to the router (no switches, wifi, EoP, etc) and see if things are stable?

    • For the moment Iā€™m not asking you to run this test, but this would be the first thing I would do to eliminate networking factors and confirm the basics are working. If youā€™ve already run this test, please let us know how it went.

I know this has dragged out Steve, and I appreciate you taking the time to summarize the above so we can be clear on where we stand. Noris and myself will be looking for your response, and Iā€™m confident we can get this stable for you.

Hope youā€™re able to identify and fix these crashes. My thought about a likely source, as I said in the above on June 16th, is a memory leak. Some process failing to return resources, like memory, eventually lead to OS stall and crash, even with modern operating systems that avoid pinning code, even the kernel. As for the ā€œTransportā€ Failed to select Roon as the current source" error, Iā€™ve had this happen occasionally but Roon seems to recover relatively quickly. In the past Iā€™d sometimes need to switch off the dCSā€™s power to get things working again. Of late, Roon seems to ā€œnegotiateā€ itself out of the issue I suppose with the dCS Network Bridge.

See responses:

The issue with the dCS network bridge after sample rate changes is different from the more generalized ā€œdeath spiral issueā€ correct?

For both issues, how often do they occur? Once every 30 mins of listening? Once every 5 hours of listening? Do they appear at the same time, or completely unrelated?

These are relatively infrequent though unpredictable. I can get them many weeks apart, or as youā€™ve seen in the past week, in the case of what I call a ā€œdeath spiralā€, a couple of time for the sudden outages and restarts of the Roon Core. May 1st and then June 11th as I recall. As for the sample rate change issue: this happens only occasionally and seems to self correct more quickly than the ā€œdeath spiralā€. I am unclear whether this is a problem with the dCS or the downstream Berkeley DAC. I have been using a 1000ms resync delay. Thinking that I might need to push this to 1500ms if the DAC is not responding in a timely manner to the dCS and that, in turn, is affecting how the Roon Core is behaving. But your guys likely can help me understand if Iā€™m barking up the wrong tree.

I assume you canā€™t reproduce this instability at will right? It happens randomly?

Yes, itā€™s really random. Since the last reported occurrence Iā€™ve been playing music via Roon for 12 hours each day and havenā€™t had either issue crop up.

When you say ā€œplayback instabilityā€, can you explain exactly what you mean? Some helpful details would be to know:

Is there an error? Or does playback just stop?

I receive no error message. Just sudden music stoppage. I go to the iPad app called Fing to see if I can ping the Nucleus and see that itā€™s services are running. In all cases both are OK. However, attempting to open a browser to the Nucleus fails. About a minute later I can get to the Nucleus browser window. That suggest a restart of Roon Core by the system, on its own. And eventually I can use the iPad Roon app to begin to start music playing once again.

Do your remotes lose connectivity when playback stops? If you use more than remote, do they all fail to connect when this instability happens, or are they all staying connected? Or a mix?

Yes. Only have the one iPad for this. Have yet to try and access the Roon Core via desktop, since itā€™s not in the same room.

Does this only happen with the dCS network bridge? Or have you seen this happen while playing to other zones?

I have never, ever seen it happen with anything but the dCS Network Bridge. I have two other Roon Ready units talking to the Nucleus, an Audio Alchemy DMP-1 and microRendu. These are wired to other Eero WiFi access points (mesh network) that communicate to the one Eero that is wired to the Cisco SG200-08 8-port Gigabit Smart Switch, which in turn is wired to the QNAP NAS, Nucleus, and dCS devices

Have you tested with a simplified network setup?

I think what I have is about as simple as it gets. Need the WiFi connection to talk to the Nucleus and the QNAP for music while the dCS and Nucleus need their connections, as well. And there was at least one instance where a failure happened while listening to an over the air radio station (no QNAP involved) which behaved in similar fashion to a local playback of an album from the NAS.

Ideally, you would put a couple files on a USB stick, plug it into the Nucleus, connect the Nucleus and dCS network bridge directly to the router (no switches, wifi, EoP, etc) and see if things are stable?

What you suggest is plausible, but given that the problem happens rather infrequently, and very randomly, this would pretty much hinder my enjoyment of music for weeks/months. Not something Iā€™m willing to do. Rather your guys build a software mouse-trap to catch Roon in the act.

For the moment Iā€™m not asking you to run this test, but this would be the first thing I would do to eliminate networking factors and confirm the basics are working. If youā€™ve already run this test, please let us know how it went.

The only other means for doing what you suggest would be to put my 6 TB of QNAP based music on another external hard drive, and USB attach it to the Nucleus. Since the Eero has two Ethernet attachments I could make this work, if necessary. If it comes to this, Iā€™d like a procedure to preserve my existing Roon database (offline) while creating a new one to support the USB-based drive.

I know this has dragged out Steve, and I appreciate you taking the time to summarize the above so we can be clear on where we stand. Noris and myself will be looking for your response, and Iā€™m confident we can get this stable for you.

@stevebythebay,

I think that you are missing a key detail here. What you are observing is a failure that is being seen out in the wild, but itā€™s only being seen consistently in your system. The fact that the failure is occurring in your setup means that your setup is the only place that Roon can troubleshoot it. If I could reproduce the behavior consistently in my lab (and Iā€™ve tried) then Roon wouldnā€™t be making these requests of you.

I understand your frustration, but without your assistance this will not get fixed in a timely manner unless Roon is able to find another setup which has all of the knobs and levers aligned just right to produce a failure. @mike is asking you to assist by performing some of the troubleshooting steps that his team would perform in their lab if they could produce the failure in their lab. If you donā€™t want to take the time to do that I completely understand, but that effectively limits the amount and quality of work that Roon can do to troubleshoot and correct this issue.

I wanted to address some specific points:

Iā€™m sorry, but your statement is false. Your network is fairly complex as it involves two classes of hardware which were never intended to work together. The Cisco managed switch is a known trouble spot due to the complexity of configuration surrounding multicast, QoS, EEE, flow control, and spanning tree. I would have to sit down and calculate it, but I would guess that the configuration permutations on that switch for those configuration classes likely totals in the thousands. If theyā€™re not ā€˜just rightā€™ then some strange and inconsistent behavior results. This is one of the reasons why Roon (and dCS) recommends against the use of managed switches. They become unsupportable unless the end-user knows exactly how to configure them for a particular use case.

The Eero mesh network is in a completely different class of device and mesh networking itself is in a completely different class of networking. These are consumer-oriented devices and in attempting to be plug-and-play they tend to make certain assumptions about the configurations and capabilities of other devices on the network. Because of the way that mesh networks behave I can see a scenario in which a misconfiguration of any of the items I listed above could cause some of the problems that youā€™re observing. As a side note, this is the same reason why Sonos has such trouble on managed networks and, in fact, Sonos will not provide any support at all if a managed switch is in the network.

ā€˜As simple as it getsā€™ means your core, NAS, and endpoint(s) connected to a typical router/switch/access point (typically sold as a ā€˜Routerā€™ at the local electronics store) that is known to work without issues. If you canā€™t plug everything into your cable company router then Iā€™m certain that between dCS and Roon we can come up with the hardware to loan you to try this testing. This would eliminate or confirm the Cisco switch and Eero Wi-Fi as potential failure points and significantly streamline the troubleshooting process. You could plug this device into your router and connect the Core, NAS, and endpoint to it as Mike requested for testing (just let it play while you arenā€™t listening) and then go back to your setup for actual listening.

Your DAC has a grand total of zero to do with this problem. You could disconnect it and bury it in the back yard and you would still observe the crashes that youā€™re seeing. There is no communication from your DAC upstream to Roon or the Bridge so neither one of them know that itā€™s there.

Please do not adjust the resync delay setting as this only has an impact on whether or not the beginning of a track is cutoff when your DAC changes sample rates. Having said that, in the past the delay has modified the behavior of the bridge during rate changes and although itā€™s not been observed to do that with the current firmware Iā€™m not 100% certain that the setting is benign. Playing with it will add yet another multiplier to the configuration permutations.

Again, completely understand the pain factor here and agree that it would be a hardship. Thatā€™s fine, but without your assistance Roonā€™s ability to troubleshoot this issue is greatly reduced. They will fix the problem eventually, but itā€™s going to take them much longer to do so.

Your comment about a ā€˜software mouse-trapā€™ made me chuckle. Where should the Roon team put that trap? Roon is a collection of different software modules that run independently and talk to each other to accomplish the end goal. As of yet the module which is experiencing the failure thatā€™s producing the symptoms cannot be determined. So, where should the Roon development team put that trap? Which of the hundreds of thousands (millions?) of lines of code should be monitored? Roon could theoretically log every single function call, but the process of doing that would likely change the behavior of the software so much due to log file I/O that it might completely mask the bug in the process.

If you arenā€™t willing to participate in the troubleshooting process that is your prerogative, but that means that you are likely going to live with this behavior for some time to come. If youā€™re upset that Roon charged you for the software and canā€™t wave a magic wand to correct this problem, then I suggest that you talk to them about a refund and then go looking for a piece of software which better fits your requirements. If youā€™re concerned that the Bridge is to blame here then Iā€™m certain that something similar can be worked out with dCS.

1 Like

Seems Iā€™m not the only one seeing problems according to Mike:

ā€œThe first is that weā€™re curently looking into some memory usage issues that can affect Roon on Linux. Obviously, we have a huge number of users running on Linux, so one major component of our investigation has been identifying why this seems to affect some users and not others.ā€

Iā€™m willing to help try and isolate the problem but have never been told that my Cisco switch or any managed switch should be avoided. I do agree that these are complex beasts and will run to Home Depot today and pickup a ā€œTP-LINK TL-SG1005Dā€ 10/100/1000Mbps Unmanaged 5-Port Gigabit Desktop Switch. That will allow me to avoid using the Eero for any switching and force me to move my music collection over from the QNAP to a USB drive. Makes one simple change in the environment, rather than many which would cloud problem source identification.

By the way, I recall your suggesting that I turn off flow control on my Cisco. However, in a recent (June 15th) document on the Roon Knowledge Base http://kb.roonlabs.com/Networking_Best_Practices under the heading Advanced Networking ā€¦ Managed Switches ā€¦ it states:

"Managed switches can be very robust, but they are often designed for professional installation, so in many cases the out-of-box configuration is not right. If your switch has a ā€œflow controlā€ setting, please make sure that it is enabled. Also, make sure that the switch is not performing any sort of throttling that might impact communication between cores, storage, remotes, and/or audio endpoints. Finally, ensure that the switch is configured to pass multicast and broadcast traffic. If in doubt about any of this, try temporarily replacing your managed switch with a ā€œdumbā€ switch to see if things improve.

To avoid any ā€œconfusionā€ Iā€™ll employ the unmanaged switch Iā€™m purchasing today, and see if that has any positive effect.

As for Eero (mesh networking generally) again Roon in the same document does not suggest avoiding such networks. And in my case the Eero is simply functioning as a WiFi access point allowing the iPad to communicate with Roon. The only noticeable issue I regularly have with using this setup today is losing connection with the Roon core, having to kill the Roon app on the iPad and restarting it. Again, this may be due to the Cisco or some other gremlin. Not even clear that the Roon logs show such events.

I fully understand the level of complexity in debugging code, especially when the running code has been pared down to the bare minimum so as to optimize performance of Roon. Thatā€™s why I suggested, even if only on a temporary basis, adding code into my specific Roon Core which would better trap/trace. I do know that the downside of this is the very possibility of introducing additional code might ā€œhideā€ the symptom.

Have now replaced the Cisco smart switch with the TP-Link unmanaged switch. Only question is whether there is any reason to use the designated WAN port for the Eero or if this really doesnā€™t matter. Iā€™ve found no documentation at the TP-Link site that clarifies if the designated WAN port really is any different either in function, priority, etc versus the other 4 LAN designated ports.

Still not sure why this happens but switching this evening (9:25 PM PDST) from music (bit stream 96k a Simon & Garfunkel song) to another 96k album Fleetwood Mac resulted in Transport error message as in past. ā€œTransport failed to select Roon as the current sourceā€. dCS behaving like a deer caught in the headlights.

Even trying to play local radio station resulted in the same error message. Tried everything to revive things on the Roon side including restarting the Core and rebooting the Nucleus. All failed until finally resorting to powering off/on the dCS Network Bridge. That returned everything to normal.

This instance may simply be a case of switching playback too quickly from one album to another. The Nucleus may just outrun the dCSā€™s ability to keep up.

Let me know if you wish to have me send latest logs.

Again this evening I found that playback stopped. I had set up radio play in Roon. Sometime afterward playback failed.

Rebooted the Roon core. The dCS was no longer visible as a zone device.

Had to power off the dCS to get Roon to see the dCS once again. After reboot I didnā€™t need to specify the dCS as Roon was pointing to it once again.

Clearly thereā€™s something occurring between Roon and the dCS thatā€™s not quite right.

Once again, let me know if you wish to review current logs.

One bit of positive news: ever since I swapped the Cisco for the unmanaged TP-Link switch Iā€™ve had few, if any, problems accessing Roon via the iPad. Previously the app had to be restarted or had to wait till Nucleus/Roon became ā€œavailableā€ as connectable server.

Swapped the TP-Link switch for a D-Link that offered me the chance to use my Uptone Audio LPS-1 instead of the provided power supply. Has proven a quieter solution.

Roon core died at 3:24 PM PDST today during playback of Lionel Richie ā€œYou Areā€. Roon restarted itself about a minute or so thereafter, and pointed to next song on album ā€œYou Mean More To Meā€. I started it playing and paused it and restarted ā€œYou Areā€ just to verify that the song data was not a problem.

As always, let me know if youā€™d like me to upload the logs.

Hope youā€™re making progress on fixing this issue. Guess the notable Roon stoppages really began at the end of June. So, it began for me in 323 build and continues with 334.

Hello @Stevebythebay,

Thank you for your recent reports, I can confirm that our systems have successfully received the most recent logs and they have been added to QAā€™s ongoing investigation. I will be sure to update you once I hear anything else.

Thanks,
Noris

Great. I suppose youā€™ve enabled something to identify when my Roon coreā€™s status has a significant change.

Iā€™ve also gone ahead and added JB Radio2 to the internet radio list, just to see if its 192/16 stream has any impact on triggering the problem Iā€™ve been experiencing.

Interesting that this evening around 9:35 PM PDST I was playing a radio station from Roon called JB Radio-2 to my Audio Alchemy DMP when it stopped. Not only did Roon crash but it didnā€™t seem to revive itself for about a half hour. Prior to that tried to access itā€™s web page via ip address and then added :9100 but still did not work. Then tried enabling Roon Bridge from one of my Macs just to see if that might ā€œkickā€ things off. Maybe that got the Nucleus to take some action, though I doubt thatā€™s what caused it to relaunch Roon.

I see numerous references to a fiveaccountserver, especially after reboots that seem to happen quite frequently in the log. Might someone or a bot at your end happen to trigger reboots? Anyway I do see quite a bit of activity in trace/debug action. Hope itā€™s helping.

This morning, not clear when, Roon was playing a local radio station and suddenly seemed like no sound was coming out of the system. Iā€™d thought something really severe had occurred. After playing with the preampā€™s volume control it appeared that, for whatever reason, the volume had been reduced quite drastically. This, even though Roon has been set for fixed volume.

That led to working with the guys at my dealership in Berkeley to posit either the Berkeley volume had somehow readjusted itself, which proved incorrect, the preamp had somehow gone ā€œbadā€ on volume/output, which also turned out not to be the case. Or something else.

The culprit, after starting up the iPadā€™s dCS app, which I rarely ever use, turned out to be the dCS. It had altered its volume from 0db to -43 db. Iā€™ve no idea how this might have happened as Iā€™d been playing the Roon radio for a local station all morning and it happened ā€œout of the blueā€.

Maybe youā€™ll find something in the current Roon log on this glitch. Wonder if anyone else has seen this happen with their Network Bridge.

Once again, and as always, let me know if you need me to do anything to help you in problem determination/ source identification.

Today another failure of the Roon Core. Unable to get on to the browser for quite some time. Finally came up and you can see the result attached.

Soon after it restarted the Roon Server on its own. And once again this happened while listening to a radio station over the 'net, no local music retrieved at all.

Hope youā€™re guys/gals are getting closer to some answer for my issues. Is it possibly a Nucleus hardware problem???

Hey @stevebythebay,

Thanks for your updates here. Iā€™ve passed your new information along to the team for their continuing investigation. As soon as new information becomes available we will be sure to update you.

Thank you for your patience here, Steve. We genuinely appreciate it!

-Dylan

1 Like

I feel like Iā€™ve become a ā€œhot potatoā€ problem in support, now with so many people having had a hand in responding. Been quite some time since I began reporting the issue. Am I the only one with these symptoms or Roon core crashing? If so, could it be somehow associated with hardware rather than software? Assuming the dCS has nothing at all to do with whatā€™s been happening it really appears to fall back on the Nucleus / Roon Core as being the culprit. If youā€™d rather someone private message me, let me know and Iā€™ll provide you with my email if you donā€™t already have it from prior communication with Danny or others in support.

Hey Steve,

Dylan has been working on your issue, and I wanted to make sure weā€™re clear about our next steps regarding the instability youā€™ve reported with your Roon Core (as distinct from the dCS Bridge issues) and I also wanted to reassure you on a few of these questions. I understand this has gone on longer than any of us would like.

Itā€™s August and we have some staff on vacation, so thatā€™s what youā€™re feeling here. Everyone youā€™ve dealt with on our support team reports to me, as does the QA team thatā€™s been working on the issues youā€™ve reported. I understand youā€™ve reached out to Danny as well over private message and I think his cell phone, and I can assure you those come to me as well.

So first, let me just say if youā€™re not happy with the support youā€™re getting from our team, you can always shoot me a PM directly.

Moreover, I want to be clear that your issue has been on my radar, and I can assure you that everyone on our side wants these issues resolved as much as you do. Sometimes weā€™re able to take a report, turn it into a reproducible bug, and have our developers resolve the problem in a day or two. In this case things have not been quite as easy, since we still donā€™t have a clear way to trigger this instability on demand.

Your case has repeatedly come up in our regular meetings with the senior development team, and our QA team continues to test against theories about what might be going on here, with the goal of finding clear, crisp steps for reproducing the problem. So I can assure this issue has been a priority across the support, QA, and dev teams. Itā€™s just a very tricky issue.

I should also mention that this issue is not limited to Nucleus, or even Roon OS. We donā€™t have any reason to think it has anything to do with your specific hardware or Nucleus at all, and I should mention that weā€™ve seen similar reports on ROCK, and in Linux Cores that are not running on Roon OS. So, I want to be clear that this is a bug in Roon software, and one weā€™re actively working on resolving.

Based on what weā€™ve heard, it seems like this issue is affecting you more often than other reports. This is one of the reasons we were hoping to get more information about how Roon performs in your environment ā€“ understanding why youā€™re seeing this more often is likely to yield invaluable clues about the underlying causes of this issue.

The tests weā€™ve asked you to run on your setup are all targeted at helping us understand why your setup seems to be affected more often than others. Seeing whether you experience these issues, for example, without local content loaded or with a fresh database will help us understand this issue, but of course itā€™s not your responsibility to test for us. It just means we all need to wait until QA finds a way to provoke these symptoms on command.

So, to summarize, we are looking at this on two tracks: why does this known issue happen occasionally for some members, and why does it happen so frequently for Steve? The testing weā€™d requested in the past would have helped with the latter, but at this point I think itā€™s probably better for us all to focus on the former, which means weā€™ll let you if we have any more questions.

If you havenā€™t heard from us, we simply donā€™t have an update yet. I can assure you our team is looking forward to resolving this soon, but for now thereā€™s simply no way I can give you an ETA about when a breakthrough is going to happen.

3 Likes

I really appreciate your detailed response. And Iā€™m surprised that the problem appears to pop up in so many environments. What is clear of late is that even without playback of local files, that is over the internet simply playing radio stations (in my case KDFC in particular) the Roon Core has crashed, independent of Roon Ready device, as I recall. So that removes my NAS and local playback as possible sources.

I gather you can determine any and all details of my Roon software configuration. The only other bits Iā€™ve not brought up are the specific cabling, though (Wireworld Ethernet connects the Nucleus to the D-Link Ethernet switch, a DGS-1005 unmanaged. The Eero WiFi access point is wired with a more conventional Cat 6a cable. I typically use my system at least 5 or more hours a day, so that may figure into my level of failures. And Iā€™ve no idea whether or not the restarts of the Core are fully shown in the logs. I never power off the Nucleus unless there is absolutely no way to kick the system into providing access via the browser. Normally I can always ping the Nucleus and see all the relevant services, and usually reboots of the Core eventually (within at most an hour) happen.

Seems there is some resource or set of resources that are getting used/abused over time, or some unique set of conditions that conspire (a perfect storm situation) to cause a restart. What Iā€™ve found in my many decades in tech is to never rule out anything. Iā€™ve seen pretty weird environmental factors conspire to create similar problems.

Hope your guys are able to pinpoint what is triggering all of this.