Find / Choose Roon Core (OSX -> Linux)

Me_Van · May 20, 2019, 7:35pm

Core Machine (Operating system/System info/Roon build number)
Roon 1.6 (416) on CentOS 7.6 (192.168.1.5)

Network Details (Including networking gear model/manufacturer and if on WiFi/Ethernet)
2x Unifi AP AC-Lite wifi (all devices on same SSID)
1x Netgear Prosafe switch

OK: 192.168.1.15 (control, OSX10.12.6) -> wifi -> 192.168.1.5 (core)
OK: 192.168.1.106 (control, iOS12.2) -> wifi -> 192.168.1.5 (core)
NOT OK: 192.168.1.10 (control, OSX10.10.5) -> netgear prosafe -> 192.168.1.5 (core)
SOMETIMES OK: 192.168.1.14 (control, OSX10.14.4) -> wifi -> 192.168.1.5 (core)

Audio Devices (Specify what device you’re using and its connection type - USB/HDMI/etc.)

KEF LSX wifi
Grace Mondo+ wifi
Squeezebox Radio wifi

Description Of Issue

Controllers are unable to find core or inconsistently find and lose core. I’ve verified via tcpdump and netstat that 9003udp and 9100-9200tcp are listening, connectable from controllers, and receiving data.

With controller on .14 9 (OSX) I can consistently reproduce: find roon core, connect, restart, looking for core…, restart, maybe find core / maybe not, etc.

With a couple of recently installed controllers on OS X the discovery is stuck on ‘looking for core’

Using the scan feature (under looking -> help) there is no progress indicator or output.

What are some tests I can run logs to examine to determine what’s wrong here?

noris · May 20, 2019, 10:02pm

Hi @Me_Van,

Welcome to the community! You mentioned that the WiFi connections mostly work as expected but the Prosafe connections do now. Which model exactly of the Prosafe are you using, is it managed?

I would take a look to make sure that you have “multicast routing” enabled on the switch if it is managed and take a look to make sure that multicast in general is set up properly.

Since you mentioned that you have CentOS here, you might want to also check out this thread to make sure that you have your multicast firewall properly configured:

Although the above thread is mostly for Android, it may contain the necessary info to set this up properly for iOS as well. Do let me know if that helps!

Thanks,
Noris

Me_Van · May 21, 2019, 1:20pm

Hello Noris --There is no firewall on the core. Just to be certain I ran a test with explicit multicast rules as shown in the post you provided. This didn’t offer any consistent results. I can sometimes connect to the core with the tables in place, sometimes not. Also, I can sometimes connect from certain clients via wifi, sometimes not. The client with a wired connection is also inconsistent. The only consistency I see is from the host 192.168.1.15 running OSX10.12.6. This is the only client that can connect to core consistently.

Given these findings, I’m nearly certain this is not a network issue.

Here’s an update from what I posted earlier:

As noted earlier I’ve verified the flow of traffic, port availability, multicast groups, etc. using tcpdump and netstat. However, since I’m not very knowledgable regarding multicast I’m not sure what specifically to look for.

Are there any low level tests I can run to see if this is a discovery issue? What does the scan do in “looking for core -> help”? Is there a configuration cached on the clients showing last core? What is the sequence of network traffic for a successful connection to the core? What core side log should I be looking at / what should I look for?

noris · May 21, 2019, 7:08pm

Hi @Me_Van,

Thanks for giving those suggestions a try. Before looking into specific settings on multicast I think it’s best that we try to simplify things to isolate the issue further to just one cause. With this in mind can you let me know the following info?

What is your router here? Are you using the CentOS to perform the routing or do you have a standard router? If you have a router, can you let me know the model/manufacturer?
If you connect one the 192.168.1.10 IP remote via WiFi instead of using the Prosafe switch, does that change the behavior? Does it then remain stable?
Let’s try to take the CentOS out of the equation for the time being. Can you please try using one of your OSX remotes to host the Roon Core as a temporary test?
You can switch Cores as many times as you’d like with one Roon subscription and the current Core will simply be un-authorized while you perform this change.

Please let me know if changing the OS core type has a different behavior here and the other information above when possible.

Thanks,
Noris

CRo · May 22, 2019, 8:00am

Hi,

although it wasn’t the problem (as seen in the thread referenced), it was important to find the root cause in that I took out as much of the network stack between the CentOS Server and the Android (in your case iOS) remotes as possible, to see what remains to cause issues.

Are iptables / firewalld (which are you using?) turned off / have no rules on the CentOS Roon core?

And your switching is dumb (non-managed)?

Me_Van · May 22, 2019, 1:59pm

Hello --To answer the questions asked earlier:

iptables: There are no rules whatsoever on any of the Roon cores tested.
There are two switches in Diagram 1 below. The Netgear is managed yet in contrast to the various test cases I’ve done I’d rule out any configuration as being cause. Same with the ERX. In fact, I’ve moved configurations established (Roon controller/core) in Diagram 2 to the environment in Diagram 1 to see same results.
Only default routing tables on all hosts.
And to anticipate questions re: Diagram 2 below. The WRT54G has very limited logic. There’s a UPnP option set to ‘enable’ but that’s the only item that might apply. As it’s used in this configuration I’m pretty sure it qualifies as a hub (not a switch).

The behavior I’m seeing is very consistent and reproducible: When .15 is a controller it will locate whichever core is available or reconnect to whichever core it’s assigned to in whatever configuration I setup. This is 100% of the time. This is the same with the iOS 12.2 device.

With other controllers:

I restart a controller repeatedly to find ‘Choose your Roon Core’ or ‘Waiting for Remote Core…’ presented 90% of the time. 10% of the time the controller is either offered or the reconnect completes successfully. This is the case with different cores used in Diagram 1 and is the case in the isolated case shown in Diagram 2. Cores have been used on: .5, .25, and .15.

Something I’ve found which allows the core to be available almost consistently is to restart the core just seconds prior to the controller restarting.

I’ve spent a huge amount of time with moving the fundamental components around. What I’d like to look at now, are the questions I asked earlier. If there are details you could provide, particularly with the discover/multicast issues, it would save me lots of time.

Diagram 1

Diagram 2

Me_Van · May 22, 2019, 11:54pm

Is there a way to increase verbosity of logs? What should be happening after this initial negotiation? I’m able to connect from 192.168.1.10 to Roon Core @ 192.168.1.5:9101 using netcat/telnet. Why would the rest of this be failing?

05/22 16:45:52 Trace: [raat] RAATServer discovered: RaatServer dirty @ 192.168.1.10:54679
05/22 16:45:52 Info: [raatserver] GOT SERVER 004288cf-2741-9f09-7011-f06686e27b82::1dd4f225-e9c3-416e-97cf-afb8dfc8a5fd @ 192.168.1.10:54679 dirty PROTOVER=1 RAATVER=1.1.36
05/22 16:45:52 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 1)
05/22 16:46:02 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 500ms
05/22 16:46:02 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 2)
05/22 16:46:12 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 750ms
05/22 16:46:13 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 3)
05/22 16:46:23 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 1125ms
05/22 16:46:24 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 4)
05/22 16:46:34 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 1687ms
05/22 16:46:36 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 5)
05/22 16:46:46 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Giving up

Me_Van · May 23, 2019, 2:56am

From one client (controller) I see a successful connection made to the core. From another I see it connect to itself and not the core. Why is this happening?

This client is successful:

/var/roon/RoonServer/Logs: 
05/22 19:41:08 Trace: [raat] RAATServer discovered: RaatServer hello @ 192.168.1.15:51436

hello# netstat -na | grep 51436
tcp        0      0 192.168.1.5:36682       192.168.1.15:51436      ESTABLISHED

This one is not:

/var/roon/RoonServer/Logs: 
05/22 19:40:22 Trace: [raat] RAATServer discovered: RaatServer arvik00 @ 192.168.1.14:60965

arvik00# netstat -na | grep 60965
tcp4       0      0  127.0.0.1.60965        127.0.0.1.60966        ESTABLISHED
tcp4       0      0  127.0.0.1.60966        127.0.0.1.60965        ESTABLISHED
tcp4       0      0  *.60965                *.*                    LISTEN

noris · May 23, 2019, 8:45pm

Hi @Me_Van,

It is not clear if you ran the previous test I suggested from the diagram, can you confirm that this was performed?

The best way to troubleshoot this issue is to make sure that everything is stable in a standard environment first:

OSX as the Core
Another OSX as the controller
Connected through an un-managed switch or directly to the router via Ethernet
Make sure that the router + OSX firewall is passing multicast properly

Only once this kind of setup is stable should you try adding more complexity back in. Can you confirm if the standard environment runs as expected on your end?

– Noris

Me_Van · May 24, 2019, 4:44am

Please review my response above. It answers all of the questions you are asking here.

Me_Van · May 24, 2019, 4:02pm

I’ve provided quite a bit here as I’m hoping to find some good technical details of how Roon functions. This would help not only with this issue I’m seeing with controller -> core inconsistency, but in testing Roon integrity when changes to neighboring systems or the network are done.

My experience thus far gives me the impression Roon operates as a fragile black box in the environment in which troubleshooting is resolved by means of playing musical chairs / mix-n-match with existing gear and configurations. There are other ways of going about this.

I wouldn’t be so determined to get this functioning (with high integrity and consistency) if I weren’t entirely impressed with the overall offering. Yes, with what I have working so far the UI is fantastic, the versatility / flexibility of devices is what I’ve been dreaming of, and the presentation of my local library is beyond expectations. I’m sure I’ll have more positive items to add once I have time to explore without obstacles.

So you have great potential for not just making a loyal customer here but a full blown ambassador / evangelist. However, since my background is technical I must see some technical transparency especially when I’m doing the legwork in providing details about my environment.

In closing I’d really like to understand the questions I’ve asked above about the fundamentals in establishing a connection between controller and core. I’m particularly interested in why after RAAT discovery one client successfully connects to the core at the negotiated port and another ends up connecting to itself (see logs/netstat output above).

Me_Van · May 29, 2019, 3:40am

Hello support? Anything on this?

I’ve created a video that demonstrates what I’ve described above. In this case, it’s a restart of the controller (client) NIC that allows the controller/core to complete the connect sequence.

https://vimeo.com/339027403

mike · May 29, 2019, 2:15pm

Hey @Me_Van – sorry for the slow response here. Noris had asked me to look over this thread to see what kind of additional feedback we might be able to provide here.

First, the symptoms you’re describing here are consistent with a network that is failing to properly pass multicast traffic between the Core and Remote devices – I don’t know why that would be, but if you have specific questions about why that might be happening I’d be happy to discuss with our developers.

That said, across more than a decade of building, shipping, and supporting networked audio protocols, I can tell you that we’ve been asked for more detailed technical information about how our discovery protocols work over and over again.

Even though we know this stuff works across tens of thousands of networks, we’ve occasionally had more technical discussions in the past, and I will tell you that I’ve almost never seen these discussions go somewhere productive – we either end up debating the merits of the technical approach, or we eventually move onto more practical troubleshooting and end up figuring out that a switch or router has a multicast setting disabled, or a device isn’t passing multicast or has a firewall active, or whatever.

No idea if anything like that is happening here, and not saying there’s no bug here – it’s possible, of course, but from what I am seeing right now this seems like a case where I would continue with the kinds of troubleshooting you’ve been doing already, in order to localize the issue to something in your environment that’s not consistently passing this traffic, which is preventing this install from working like Roon is in other environments.

To be clear, these kinds of “musical chairs” techniques aren’t a way for us to avoid sharing details, or for our support team to save time. This is how our senior technical staff troubleshoots issues in their own homes, for a simple reason: it always works. Once you’ve localized the issue to a particular piece of the environment, you’re 95% of the way there.

Again, if you have specific technical questions we can look into them, but for now I think clarity about the tests you’ve actually run here is probably the next step.

It sounds like you’ve tried running the simplified network and the Mac as the Core. Is that correct?
We’ll know a lot more if we have a clear understanding (and preferably logs + timestamps) for whether this issue occurs with:

Simplified network & CentOS Core
Regular network & Mac Core

Thanks @Me_Van – looking forward to getting this working for you soon!

Me_Van · May 29, 2019, 3:44pm

Hello and thanks for getting back. As reported a few times above the issue is this:

I am unable to consistently connect to a Roon core on CentOS 7.6 OR a Roon core on OS X 10.12.6. I am using a mix of clients as noted above. This inconsistency is the same whether the environment is that noted in Diagram 1 (above) or Diagram 2.

The video recently posted demonstrates that the client is able to connect to the core if the NIC is disabled then immediately reenabled. I am able to reproduce this behavior with several different clients connecting to Roon core on CentOS 7.6 OR a Roon core on OS X 10.12.6 in both environments noted above Diagram 1 or Diagram 2.

What would be very helpful is understanding the expected network flow, even if in summary. Also, while I’ve confirmed the UDP ports are available I’m not sure how to test SSDP.

noris · May 31, 2019, 5:34pm

Hi @Me_Van,

Thanks for sharing that video and for the update. We discussed this issue with our CTO and technical team, and want to make sure we have a clear understanding of how things are failing in the simplified environment.

This is often a good strategy in cases where multiple issues might at play – if you can identify one issue in the simpler setup and get things working there, it’s often much easier to isolate the problem in the more complex environment.

So, we’d like to be methodical and gather some data, and hopefully our team is able to help make progress here.

Here’s what we’d like to try next:

Let’s get the simplified setup of the two machines, (OSX Core + OSX Remote) both restarted and connected via Ethernet to your Linksys WRT54G router. Start Roon on the Core machine first, then the remote machine.
Once you’ve done this, can you let us know the exact local time + date in your country of when the disconnecting behavior occurs? E.g. 1:12PM on 5/31/19? After getting this info, we can then enable diagnostics mode for your Core + Remote and review the Roon logs.

Also, some additional questions:

Are you using any VLANs or VPNs on the Core or Remotes affected by this issue?
Do you have any firewall or multicast blocking applications on your Core or Remotes such as “Little Snitch”, “Bullguard” or other tools? If so we would suggest that you disable these, reboot, and run the test above.

This exercise will help us get a clean view on what’s going on between these two machines - once that works, we’ll be able to see if that extends to the more complex environment, and hopefully get this working for you soon.

Thanks,
Noris

Me_Van · June 1, 2019, 7:37pm

You’ll see another one of my isolated test cases from 2019-06-01T18:00:00Z through 2019-06-01T18:00:00Z (1100-1130am PDT). This involved the WRT54G as an isolated switch, uplink to Internet, DHCP, 2x OSX connect via RJ-45, 1x OSX connect via WiFi. Same illustration as above in Diagram 2, minus the CentOS.

Everything was powered down. Little Snitch was installed on a controller but has never had active rules. This time it was entirely removed, rebooted. All connections verified tcp, icmp.

Same results as always. Intermittent connections, restarting NIC allows controller to find core. On the OSX 10.10.5 desktop I also removed Roon with ccleaner, reinstalled, rebooted, and interestingly this one never found the core, even with NIC restart.

I’m eager to hear of your findings.