Find / Choose Roon Core (OSX -> Linux)

Core Machine (Operating system/System info/Roon build number)
Roon 1.6 (416) on CentOS 7.6 (192.168.1.5)

Network Details (Including networking gear model/manufacturer and if on WiFi/Ethernet)
2x Unifi AP AC-Lite wifi (all devices on same SSID)
1x Netgear Prosafe switch

OK: 192.168.1.15 (control, OSX10.12.6) -> wifi -> 192.168.1.5 (core)
OK: 192.168.1.106 (control, iOS12.2) -> wifi -> 192.168.1.5 (core)
NOT OK: 192.168.1.10 (control, OSX10.10.5) -> netgear prosafe -> 192.168.1.5 (core)
SOMETIMES OK: 192.168.1.14 (control, OSX10.14.4) -> wifi -> 192.168.1.5 (core)

Audio Devices (Specify what device you’re using and its connection type - USB/HDMI/etc.)

  • KEF LSX wifi
  • Grace Mondo+ wifi
  • Squeezebox Radio wifi

Description Of Issue

Controllers are unable to find core or inconsistently find and lose core. I’ve verified via tcpdump and netstat that 9003udp and 9100-9200tcp are listening, connectable from controllers, and receiving data.

With controller on .14 9 (OSX) I can consistently reproduce: find roon core, connect, restart, looking for core…, restart, maybe find core / maybe not, etc.

With a couple of recently installed controllers on OS X the discovery is stuck on ‘looking for core’

Using the scan feature (under looking -> help) there is no progress indicator or output.

What are some tests I can run logs to examine to determine what’s wrong here?

Hi @Me_Van,

Welcome to the community! You mentioned that the WiFi connections mostly work as expected but the Prosafe connections do now. Which model exactly of the Prosafe are you using, is it managed?

I would take a look to make sure that you have “multicast routing” enabled on the switch if it is managed and take a look to make sure that multicast in general is set up properly.

Since you mentioned that you have CentOS here, you might want to also check out this thread to make sure that you have your multicast firewall properly configured:

Although the above thread is mostly for Android, it may contain the necessary info to set this up properly for iOS as well. Do let me know if that helps!

Thanks,
Noris

1 Like

Hello Noris --There is no firewall on the core. Just to be certain I ran a test with explicit multicast rules as shown in the post you provided. This didn’t offer any consistent results. I can sometimes connect to the core with the tables in place, sometimes not. Also, I can sometimes connect from certain clients via wifi, sometimes not. The client with a wired connection is also inconsistent. The only consistency I see is from the host 192.168.1.15 running OSX10.12.6. This is the only client that can connect to core consistently.

Given these findings, I’m nearly certain this is not a network issue.

Here’s an update from what I posted earlier:

As noted earlier I’ve verified the flow of traffic, port availability, multicast groups, etc. using tcpdump and netstat. However, since I’m not very knowledgable regarding multicast I’m not sure what specifically to look for.

Are there any low level tests I can run to see if this is a discovery issue? What does the scan do in “looking for core -> help”? Is there a configuration cached on the clients showing last core? What is the sequence of network traffic for a successful connection to the core? What core side log should I be looking at / what should I look for?

Hi @Me_Van,

Thanks for giving those suggestions a try. Before looking into specific settings on multicast I think it’s best that we try to simplify things to isolate the issue further to just one cause. With this in mind can you let me know the following info?

  • What is your router here? Are you using the CentOS to perform the routing or do you have a standard router? If you have a router, can you let me know the model/manufacturer?

  • If you connect one the 192.168.1.10 IP remote via WiFi instead of using the Prosafe switch, does that change the behavior? Does it then remain stable?

  • Let’s try to take the CentOS out of the equation for the time being. Can you please try using one of your OSX remotes to host the Roon Core as a temporary test?
    You can switch Cores as many times as you’d like with one Roon subscription and the current Core will simply be un-authorized while you perform this change.

Please let me know if changing the OS core type has a different behavior here and the other information above when possible.

Thanks,
Noris

Hi,

although it wasn’t the problem (as seen in the thread referenced), it was important to find the root cause in that I took out as much of the network stack between the CentOS Server and the Android (in your case iOS) remotes as possible, to see what remains to cause issues.

Are iptables / firewalld (which are you using?) turned off / have no rules on the CentOS Roon core?

And your switching is dumb (non-managed)?

Hello --To answer the questions asked earlier:

  • iptables: There are no rules whatsoever on any of the Roon cores tested.
  • There are two switches in Diagram 1 below. The Netgear is managed yet in contrast to the various test cases I’ve done I’d rule out any configuration as being cause. Same with the ERX. In fact, I’ve moved configurations established (Roon controller/core) in Diagram 2 to the environment in Diagram 1 to see same results.
  • Only default routing tables on all hosts.
  • And to anticipate questions re: Diagram 2 below. The WRT54G has very limited logic. There’s a UPnP option set to ‘enable’ but that’s the only item that might apply. As it’s used in this configuration I’m pretty sure it qualifies as a hub (not a switch).

The behavior I’m seeing is very consistent and reproducible: When .15 is a controller it will locate whichever core is available or reconnect to whichever core it’s assigned to in whatever configuration I setup. This is 100% of the time. This is the same with the iOS 12.2 device.

With other controllers:

I restart a controller repeatedly to find ‘Choose your Roon Core’ or ‘Waiting for Remote Core…’ presented 90% of the time. 10% of the time the controller is either offered or the reconnect completes successfully. This is the case with different cores used in Diagram 1 and is the case in the isolated case shown in Diagram 2. Cores have been used on: .5, .25, and .15.

Something I’ve found which allows the core to be available almost consistently is to restart the core just seconds prior to the controller restarting.

I’ve spent a huge amount of time with moving the fundamental components around. What I’d like to look at now, are the questions I asked earlier. If there are details you could provide, particularly with the discover/multicast issues, it would save me lots of time.

Diagram 1

Diagram 2

Is there a way to increase verbosity of logs? What should be happening after this initial negotiation? I’m able to connect from 192.168.1.10 to Roon Core @ 192.168.1.5:9101 using netcat/telnet. Why would the rest of this be failing?

05/22 16:45:52 Trace: [raat] RAATServer discovered: RaatServer dirty @ 192.168.1.10:54679
05/22 16:45:52 Info: [raatserver] GOT SERVER 004288cf-2741-9f09-7011-f06686e27b82::1dd4f225-e9c3-416e-97cf-afb8dfc8a5fd @ 192.168.1.10:54679 dirty PROTOVER=1 RAATVER=1.1.36
05/22 16:45:52 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 1)
05/22 16:46:02 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 500ms
05/22 16:46:02 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 2)
05/22 16:46:12 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 750ms
05/22 16:46:13 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 3)
05/22 16:46:23 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 1125ms
05/22 16:46:24 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 4)
05/22 16:46:34 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Retrying in 1687ms
05/22 16:46:36 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] connecting (attempt 5)
05/22 16:46:46 Trace: [raatserver] [RaatServer dirty @ 192.168.1.10:54679] client connection failed. Giving up

From one client (controller) I see a successful connection made to the core. From another I see it connect to itself and not the core. Why is this happening?

This client is successful:

/var/roon/RoonServer/Logs: 
05/22 19:41:08 Trace: [raat] RAATServer discovered: RaatServer hello @ 192.168.1.15:51436

hello# netstat -na | grep 51436
tcp        0      0 192.168.1.5:36682       192.168.1.15:51436      ESTABLISHED

This one is not:

/var/roon/RoonServer/Logs: 
05/22 19:40:22 Trace: [raat] RAATServer discovered: RaatServer arvik00 @ 192.168.1.14:60965

arvik00# netstat -na | grep 60965
tcp4       0      0  127.0.0.1.60965        127.0.0.1.60966        ESTABLISHED
tcp4       0      0  127.0.0.1.60966        127.0.0.1.60965        ESTABLISHED
tcp4       0      0  *.60965                *.*                    LISTEN

Hi @Me_Van,

It is not clear if you ran the previous test I suggested from the diagram, can you confirm that this was performed?

The best way to troubleshoot this issue is to make sure that everything is stable in a standard environment first:

  • OSX as the Core
  • Another OSX as the controller
  • Connected through an un-managed switch or directly to the router via Ethernet
  • Make sure that the router + OSX firewall is passing multicast properly

Only once this kind of setup is stable should you try adding more complexity back in. Can you confirm if the standard environment runs as expected on your end?

– Noris

Please review my response above. It answers all of the questions you are asking here.

I’ve provided quite a bit here as I’m hoping to find some good technical details of how Roon functions. This would help not only with this issue I’m seeing with controller -> core inconsistency, but in testing Roon integrity when changes to neighboring systems or the network are done.

My experience thus far gives me the impression Roon operates as a fragile black box in the environment in which troubleshooting is resolved by means of playing musical chairs / mix-n-match with existing gear and configurations. There are other ways of going about this.

I wouldn’t be so determined to get this functioning (with high integrity and consistency) if I weren’t entirely impressed with the overall offering. Yes, with what I have working so far the UI is fantastic, the versatility / flexibility of devices is what I’ve been dreaming of, and the presentation of my local library is beyond expectations. I’m sure I’ll have more positive items to add once I have time to explore without obstacles.

So you have great potential for not just making a loyal customer here but a full blown ambassador / evangelist. However, since my background is technical I must see some technical transparency especially when I’m doing the legwork in providing details about my environment.

In closing I’d really like to understand the questions I’ve asked above about the fundamentals in establishing a connection between controller and core. I’m particularly interested in why after RAAT discovery one client successfully connects to the core at the negotiated port and another ends up connecting to itself (see logs/netstat output above).

Hello support? Anything on this?

I’ve created a video that demonstrates what I’ve described above. In this case, it’s a restart of the controller (client) NIC that allows the controller/core to complete the connect sequence.

https://vimeo.com/339027403

Hey @Me_Van – sorry for the slow response here. Noris had asked me to look over this thread to see what kind of additional feedback we might be able to provide here.

First, the symptoms you’re describing here are consistent with a network that is failing to properly pass multicast traffic between the Core and Remote devices – I don’t know why that would be, but if you have specific questions about why that might be happening I’d be happy to discuss with our developers.

That said, across more than a decade of building, shipping, and supporting networked audio protocols, I can tell you that we’ve been asked for more detailed technical information about how our discovery protocols work over and over again.

Even though we know this stuff works across tens of thousands of networks, we’ve occasionally had more technical discussions in the past, and I will tell you that I’ve almost never seen these discussions go somewhere productive – we either end up debating the merits of the technical approach, or we eventually move onto more practical troubleshooting and end up figuring out that a switch or router has a multicast setting disabled, or a device isn’t passing multicast or has a firewall active, or whatever.

No idea if anything like that is happening here, and not saying there’s no bug here – it’s possible, of course, but from what I am seeing right now this seems like a case where I would continue with the kinds of troubleshooting you’ve been doing already, in order to localize the issue to something in your environment that’s not consistently passing this traffic, which is preventing this install from working like Roon is in other environments.

To be clear, these kinds of “musical chairs” techniques aren’t a way for us to avoid sharing details, or for our support team to save time. This is how our senior technical staff troubleshoots issues in their own homes, for a simple reason: it always works. Once you’ve localized the issue to a particular piece of the environment, you’re 95% of the way there.

Again, if you have specific technical questions we can look into them, but for now I think clarity about the tests you’ve actually run here is probably the next step.

It sounds like you’ve tried running the simplified network and the Mac as the Core. Is that correct?
We’ll know a lot more if we have a clear understanding (and preferably logs + timestamps) for whether this issue occurs with:

  • Simplified network & CentOS Core
  • Regular network & Mac Core

Thanks @Me_Van – looking forward to getting this working for you soon!

Hello and thanks for getting back. As reported a few times above the issue is this:

I am unable to consistently connect to a Roon core on CentOS 7.6 OR a Roon core on OS X 10.12.6. I am using a mix of clients as noted above. This inconsistency is the same whether the environment is that noted in Diagram 1 (above) or Diagram 2.

The video recently posted demonstrates that the client is able to connect to the core if the NIC is disabled then immediately reenabled. I am able to reproduce this behavior with several different clients connecting to Roon core on CentOS 7.6 OR a Roon core on OS X 10.12.6 in both environments noted above Diagram 1 or Diagram 2.

What would be very helpful is understanding the expected network flow, even if in summary. Also, while I’ve confirmed the UDP ports are available I’m not sure how to test SSDP.

Hi @Me_Van,

Thanks for sharing that video and for the update. We discussed this issue with our CTO and technical team, and want to make sure we have a clear understanding of how things are failing in the simplified environment.

This is often a good strategy in cases where multiple issues might at play – if you can identify one issue in the simpler setup and get things working there, it’s often much easier to isolate the problem in the more complex environment.

So, we’d like to be methodical and gather some data, and hopefully our team is able to help make progress here.

Here’s what we’d like to try next:

  1. Let’s get the simplified setup of the two machines, (OSX Core + OSX Remote) both restarted and connected via Ethernet to your Linksys WRT54G router. Start Roon on the Core machine first, then the remote machine.

  2. Once you’ve done this, can you let us know the exact local time + date in your country of when the disconnecting behavior occurs? E.g. 1:12PM on 5/31/19? After getting this info, we can then enable diagnostics mode for your Core + Remote and review the Roon logs.

Also, some additional questions:

  1. Are you using any VLANs or VPNs on the Core or Remotes affected by this issue?
  2. Do you have any firewall or multicast blocking applications on your Core or Remotes such as “Little Snitch”, “Bullguard” or other tools? If so we would suggest that you disable these, reboot, and run the test above.

This exercise will help us get a clean view on what’s going on between these two machines - once that works, we’ll be able to see if that extends to the more complex environment, and hopefully get this working for you soon.

Thanks,
Noris

You’ll see another one of my isolated test cases from 2019-06-01T18:00:00Z through 2019-06-01T18:00:00Z (1100-1130am PDT). This involved the WRT54G as an isolated switch, uplink to Internet, DHCP, 2x OSX connect via RJ-45, 1x OSX connect via WiFi. Same illustration as above in Diagram 2, minus the CentOS.

Everything was powered down. Little Snitch was installed on a controller but has never had active rules. This time it was entirely removed, rebooted. All connections verified tcp, icmp.

Same results as always. Intermittent connections, restarting NIC allows controller to find core. On the OSX 10.10.5 desktop I also removed Roon with ccleaner, reinstalled, rebooted, and interestingly this one never found the core, even with NIC restart.

I’m eager to hear of your findings.

Other questions:

VLAN: Yes, VLANS are in use in Diagram 1 supporting multiple subnets consisting of VMs. This is entirely separate from the network supporting Roon. Tagging is done on separate physical interfaces. No tagging on the Netgear in Dia 1, only on the ERX.

VPN: Yes 1) from the OS X 10.12.16 to a secured/private network. This has been explicitly turned off in all tests involving Roon. 2) There’s an haproxy on 192.168.1.5 load balancing 2x http/https proxies which use VPN tun0 interfaces. This is routable / internet bound traffic only. Also, proxy settings have been explicitly disabled in many tests involving Roon.

Just being honest. I could’ve easily responded ‘no’ to the vlan/vpn question. I’m highly confident neither of these is the culprit. If so, I will eat my shoe.

I will continue to ask for further detail on how you are using multicast / discovery. My understanding is minimal here so exploring is very time consuming. I’d be willing to provide pcaps --but preferably through more direct, secure means. Also, let me know if the iperf test described here would be useful:

Hi @Me_Van,

Thanks for letting me know the above information. The presence of VLANs, VPNs, firewalls are probably related to these discovery issues, especially since your network isn’t performing like other Roon installs, even when the topology is simplified.

Obviously, we can’t advise our users about how to configure any of these applications, but the good news is that thousands of Roon users have figured out how to ensure everything connects as expected, even in more complex environments.

My advice here would be to confirm none of these products are running on your devices and possibly reset your network adapter so you’re running with a factory default network stack and OS with factory default settings.

With a clean setup like this, everything should connect as expected every single time, and then you can add complexity back in and see where things break. This process can take a little time, but the good news is that it always works as a troubleshooting technique.

Let us know how that goes, and we can definitely dig in some more once everything is connecting, and once we’ve identified where things are going wrong in your environment once you start adding the complexity back in.

– Noris

The test case I provided you with was entirely isolated and without VPN or VLAN tagging. The results have been the same in this most recent configuration, without tagging, and with all the other configurations described above.

You didn’t provide any detail with regards to the test I did on 6/1. I understood that to be our agreement when I invested the precious time in doing it.

Based on this response, and many of the responses you’ve given over the past two weeks it looks like you are looking for something to evade the issue. I was hoping you, or someone at Roon, could do better.

Help me see this differently. I don’t think it will involve much more effort than what you’ve provided thus far.

The symptoms you’re describing here happen when a network isn’t configured properly. I’m not saying there isn’t a bug here but if there is I need more information, preferably a description of how to make this issue occur in a QA environment – then I can get this in front of a developer and fix the bug.

If we can’t demonstrate that there’s a bug here, the next step is troubleshooting. There’s some reason your environment performs differently from thousands of others, and we can advise how to figure out why that is, or we can investigate reproducible bugs. But those are really the only options here.

We’ve resolved hundreds of issues just like this, but I’m sure you can understand that we have to draw the line somewhere, and the focus of our support team is always going to be identifying bugs and helping user’s troubleshoot environmental issues.

Thanks again @Me_Van.