Grouped Playback exhibiting Clock Drift

Brian_Lloyd · October 25, 2016, 6:12pm

I was getting ready to create a new thread to talk about zone synchronization and discovered that it essentially started here and probably should continue here. I am also going to take some minor exception to some of the things posted here if only to clear some things up.

A bit about my background. I have been involved in data communications professionally for quite some time. Suffice it to say, if you use the Internet, you are using technology that I, in part, helped create. I also experiment with software-defined radio which, at some of the extreme cases I play with (software defined radio interferometry) the problems of synchronizing playback clocks seems moderately trivial.

In the case of a single streaming player, having the DAC clock control the transfer is necessary and sufficient. It works just peachy. You can have a relatively inexpensive local oscillator with low phase noise (results in low jitter) to provide clocking to the DAC and there will be few, if any, IM products produced by the phase modulation of the clock. The only problem is, once you have more than one oscillator, i.e. you have more than one zone, you end up with clock slip, as mentioned above.

I currently have 6 zones in my house. The difference in clock accuracy is readily apparent when I group overlapping zones. I get clock/buffer slip relatively often. It is annoying. I have been playing with RPi3-based devices from IQaudIO and they are especially bad. I suspect their “badness” stems from using the RPis single clock which, I suspect, is not a nice integer multiple of 44.1kHz or 48kHz. Where I use DACs with better clocks, I see less problem. Do we use an isochronous source and keep everything in lock-step (the S/PDIF approach with a master word clock) or is there a way to preserve the natural asynchronous transfer but still lock the clocks in all the devices to central reference?

I have done both. Both work. It is possible to have a really good, low-phase-noise voltage-controlled local oscillator that is “steered” by a phase-locked loop that locks to a master reference. AES/EBU, TOSlink, and S/PDIF support this and really good DACs extract the clock from the data stream and synchronize the local clock, but the controlled oscillator retains its low-phase noise quality. So it is possible to have slaved clock quality (phase noise and jittter) just as high as that of a free-running crystal oscillator.

Of course, we could just slave all the clocks to a single master source. GPS-disciplined clocks are pretty common and relatively cheap now. This seems (to me) to be the way to go for keeping things in sync. It shouldn’t cost more than about $200 to add GPS-disciplining to the clocks in a DAC. (And one can get the frequency right by multiplying up from 1Hz rather than down from some higher frequency. (Every even-Hz frequency is an integer multiple of 1Hz.) GPS receivers with 1 pulse-per-second (PPS) are cheap (under $50 in unit quantities) and the circuitry needed to frequency lock the clocks to the 1PPS is relatively inexpensive too. Time to bug some DAC and/or S/PDIF streamer makers about this.

But there is one thing Roon could do to make the problem less noticeable. Basically you just need to insert a long enough gap between cuts to let the buffers run dry. It is a standard async resynchronization approach. (After all, what are stop-bits for?)

And maybe they have done it but I have missed it. Crossfade time maybe? But if it is crossfade time, a period of 1 second is unnecessarily long. Since Roon has the information about buffer size and sample rate, it can calculate the worst-case resync time, on the order of 25 ms for a 1kB buffer at 44.1kHz. I doubt a 25ms gap is going to be noticeable … unless you are listening to something inherently gapless. (Abby Road anyone?) Even then it is probably better than the mis-synchronization continuing to drift farther out.

Other ideas about this?

hifi_swlon · October 25, 2016, 6:59pm

What does the slip sound like? I have a RoonReady device (microrendu), a pi2 with hifiberry, and a pi3 with iqaudio and have never ‘heard’ clock slip to my knowledge. I guess three zones are a lot easier to sync than six?

I don’t know what the solution to this problem is, but I certainly wouldn’t want gaps appearing in my gapless material.

Geoff_Coupe · October 25, 2016, 7:59pm

Hmm, I’m with Steve on this one - I have the same three types of endpoint (mR, pi2 with HiFiBerry and pi3 with IQaudIO), and I’ve not heard clock slip either. The mR and pi2 are actually in the same room, in the same Group, and I often have them active together. Never noticed any sync problems…

brian · October 25, 2016, 9:55pm

They’re trivial if you’re thinking like an architect, starting from scratch and defining a full hardware/software stack with this problem at the center. Certainly, the concept isn’t hard.

In the world we live in, there are few limitations that are worth mentioning up front:

Solutions that depend on a hardware PLL are impractical, since it’s very rare that we actually have the ability to make fine clock adjustments in software via audio drivers as they most commonly exist today.
Depdending on awesome top-down ideas like the GPS clock are impractical, since we play music on hardware that comes from different manufacturers at different price points–all the way from mid-range Android phones up to a $100k+ dCS Vivaldi stack, and we must work with things already in the market.
Doing accurate clock synchronization over WiFi networks can be difficult because WiFi hops sometimes impose unexpected asymmetrical delays on network traffic (i.e. delays that undermine 1/2 RTT reasoning). We are actively working on figuring out ways to mitigate this problem.

I agree, the underlying techniques: measuring clock drift and correcting for it, aren’t really that hard. In my experience, they are sometimes non-obvious to people who haven’t spent time thinking about the problem before, but that’s about it.

In any case, RAAT already measures and corrects for clock drift. This was designed in from the start, as it should have been.

Usually when we see clocks drifting audibly, it can be traced back to some networking detail–either WiFi or a managed switch–interfering with clock synchronization exchanges.

The way RAAT handles multi-zone playback/clock management is like this:

First, we elect one of the devices to be the “clock master”. That device drives the pace of the playback stream. We could just as easily use the computer’s system clock for this, but choosing a master from among the playback devices potentially gives us access to a better clock, and allows one of the zones to operate without performing corrections at all, which can be desirable for the “best” room in the house.

The “clock slave” devices receive the stream at rates that may mismatch their personal clock rates. They are responsible for periodically performing clock synchronization against the master device, modeling the master clock rate, and then applying corrections to compensate for the clock rate discrepancy–either by altering the stream in software/DSP, or by controlling a hardware PLL.

I am familiar with some commercial products that take this approach as an alternative to handling drift correction as a first class problem (as we have).

Clock rate discrepancies of 100ppm from one device to the next are not uncommon in our observations That means by the end of a 10 minute song, we’d see 60ms of drift–bad enough to be easily noticeable to a layman. Forget about listening to a full symphony or a concept album comfortably.

This is a band-aid that compromises quality. Not an option for us.

Your ideas so far have rested on an implicit assumption that we missed the important detail of “compensating for clock drift” when engineering RAAT. Since that is not the case, I think the next steps here are to do some troubleshooting and figure out what aspect of your environment is exacerbating the problem.

One of the tests we perform when certifying Roon Ready devices is to ensure that they can sustain grouped playback for 24hrs without drifting relative to other devices. This test is performed regularly enough over here that we’re confident that–in general–the drift correction mechanism is capable of working properly.

You might be running into an edge case, or a bug, or maybe your network is behaving in a way that we didn’t anticipate. I’d like us to get to the bottom of this regardless.

(cc @support)

Brian_Lloyd · October 26, 2016, 5:33pm

Thanks for responding to this. I must apologize for making assumptions. I must admit, I am amazed at the what roon does and how well it does it. I should have realized that you had addressed the problem. Of course, one [now] obvious [to me] solution is non-integer sample rate conversion in DSP to keep everything in step.

As an aside, how is the “master” elected? Are there any things that would make it easier/better?

Right now my players (Roon Bridge) are either dedicated Mac Minis driving BADA Alpha DACs, or RPi’s with either an IQaudIO DAC+ or DigiAmp+ HATs. I have some IQaudIO DIGI+ HATs coming to try in place of the Mac Mini’s streaming to the BADA Alpha DACs.

OK, so we can probably lay the problem at the feet of the network. For information, my Roon server is a 3GHz I7 Mac Mini with an SSD boot drive, with the audio storage on a hardware RAID connected with Thunderbolt. The server runs the Roon server and my home automation server (Indigo). It really isn’t doing anything else. Indigo doesn’t consume much. Top says that Roon is the top consumer of CPU, peaking at about 20%.

My L2 ethernet network is implemented by a Cisco Catalyst 4948 switch. This is a managed switch running IOS 15.0(2). I currently have about 35 active ethernet drops in my house. Right now I am running a single L2 broadcast domain (VLAN 0) for everything. (I have been thinking of segregating the audio system into its own VLAN/subnet so high-bandwidth video won’t impact audio but I figured at 1Gbps it wouldn’t be a problem.) The router is PfSense running on a dedicated PC but it isn’t in the path other than to provide DHCP and DNS services as well as a default route to the Internet. I use WiFi only for devices that don’t have ethernet, e.g. my iPhone and iPads, and they are only used for control functions, nothing that is latency-jitter sensitive.

I can provide the cisco config file if you like but I haven’t turned on any of the policy management features … yet. So right now is is functioning more or less as a dumb switch.

Your suggestions? Would you like me to start a new thread for support and get it out of this thread?

brian · October 26, 2016, 10:56pm

I think this should be approached from 3 fronts:

Experiment with the linking configurations and try to find a configuration with just 2 zones that exhibits noticeable drift. You may find that this is limited to only certain pairs of devices, or a certain device, or something–which will help point us in a direction.
Lets have @support get in touch with you, and capture a support package at a point right after you’ve noticed substantial drift. There are some useful stats being captured/logged during playback. We’ll at least have an estimate of the drift of each of your devices relative to the system clock, and a sense of how much variance is happening in the clock measurements.
If possible, temporarily mess with the network so it is just your Roon Core machine + the minimum number of audio devices needed to reproduce the problem, plugged into either a dumb switch or a consumer-grade wifi router. You’ve got a big network–if isolating like this modulates the problem, it gives us clues of where to look.

One last thing to try–I don’t know the current status of IQaudIO’s Roon Ready image (or if that’s how you’re using the device), but it’s fairly likely that the version of RAAT that is on that image is on the older side since they were one of our first partners. There have been some improvements to clock synchronization stuff over the past few months. Nothing that I would characterize as “critical”, but we did implement a better low-pass filter to smooth out noisy clock synchronization exchanges and it’s fairly likely that that’s not made it to their images yet.

If you are running their image, you could try installing Roon Bridge on Raspbian instead if using their image. This would at least determine if the latest RAAT bits make the difference for you. If so, we’ll encourage them to push out a new version.

I’m going to pull this out into a support thread.

AMP · October 27, 2016, 12:08am

Hey @brian and @support I’m experiencing this problem right now…

2 Zones grouped

“Primary” is an Ubuntu 16.0.4 server (Intel core i3) machine running the bridge code and feeding a Simaudio MOON 380D via USB.

“Secondary” is a dCS Vivaldi upsampler (Roon Ready) running their version 307 firmware.

Core is an i5 NUC running Ubuntu 16.0.4 server with 8GB RAM and locally-attached storage.

Network is Cisco SG-200 series switches (unmanaged) and all of the connections involved are hardwired.

Delay amount is variable as time goes on, but it got close to a second at one point. The Vivaldi is always the one lagging behind the bridge. When I started playback the sync was perfect and stayed that way for a while. I’m not 100% certain when the drift got bad as I was on the phone for a bit.

I’ve set aside a copy of the RoonServer log for review and will keep this playing as long as I can. I can arrange a remote access connection for support in case someone wants to watch it in real time.

AMP · October 27, 2016, 12:19am

… and then there was a stutter and dropout shortly after a track change. Roon re-started the stream and everything fell back into sync. This is not the first time I’ve experienced this and I’m willing to bet that the two zones will drift apart over the course of the next 90 minutes.

brian · October 27, 2016, 12:30am

@AMP,

dCS uses a different RAAT implementation than just about everything else, that is also very freshly developed/released. If your problem is specific to the Vivaldi, then it’s likely not the same as Brian’s.

The first thing I would suggest is to temporarily remove the Vivaldi from the setup and see if you have issues linking your other zone with something else. This will point in one way or another regarding dCS’s RAAT implementation.

AMP · October 27, 2016, 12:42am

@brian

Unfortunately, right now all I have that can be grouped is a Vivaldi, Rossini, and this Ubuntu bridge. I can try lighting up another bridge tomorrow morning just to see if I can replicate the same problem.

As to the Vivaldi’s RAAT implementation I understand that it’s different. Regardless, group playback involving a dCS device is a real problem and I’m fairly convinced that I can reproduce it any time in case anyone is interested in investigating.

Thanks!

Bob_Welsh · October 27, 2016, 12:55am

I also have this drift/sync problem and may have some insight on a workaround or solution. I have a 3 Roon endpoint system. I was experiencing drift on one of the endpoints but then resolved the problem and the solution had nothing to do with the network but rather to make sure all DACs were asynchronous USB or directly connected to the server.

Problem: Roon endpoint drifting relative to other Roon endpoints (2 of my endpoints are USB and HDMI connected directly to my MacBook Air server running Roon, the 3rd and problem endpoint was a Sonic Orbiter SE connected via Toslink to an iFi Retro 50 DAC.

Solution: to switch to direct USB connection from the endpoint to the iFi Retro DAC (instead of toslink connection between the same Sonic Orbiter endpoint and same iFi Retro DAC). So the only variable changed was how I connected the endpoint “computer” to the actual DAC.

My theory: a bit long below, but given for reference. I would like to use toslink endpoints in the future and am not sure how to do this reliably. Thanks for any further clarifications. Here’s more on my theory on the resolution which aligns with what others have written in earlier posts:

I’ve been pondering this drift issue and realized that the level of accuracy in cheap clocks is pretty low (and the Sonic Orbiter (SOSE) for sure has a cheap clock, possibly not even a crystal oscillator, maybe just RLC - not sure). But even modestly highend crystal (or Rubidium) oscillators are far far more accurate. There are 86,400 seconds in a day. So every hour is 3600 seconds, so if my oscillator in the SOSE is 100ppm accurate (assumption), then after one hour of listening, I’d be out of sync by 1 part in 10000 of 3600 seconds - or ~1/3 of a second. This seems about what’s happening. But then I have the sense that the crystal oscillator in the iFi Retro 50 is probably a lot better than this - probably accurate to 1ppm - again an assumption but I’m guessing it may even be better than that. That means that after one hour of listening I’d be out of synch by 3 msec or perhaps much less (barely audible). I’m of course making and educated guess about the accuracy of the iFi Retro clock, but given that it has an Asynchronous USB DAC and if I connect via USB (instead of toslink which is synchronous to what’s being sent by the SOSE which gets its info from the network and then has to reclock that out via toslink), then the iFi becomes the clock reference and is likely good enough.

Brian_Lloyd · October 27, 2016, 1:52am

[quote=“brian, post:6, topic:15407”]
Experiment with the linking configurations and try to find a configuration with just 2 zones that exhibits noticeable drift. You may find that this is limited to only certain pairs of devices, or a certain device, or something–which will help point us in a direction.[/quote]

I see the problem whenever I run any two zones grouped. And now that I am paying more attention, the problem seems to be sync slip that goes in jumps. It runs fine for awhile and then suddenly there is a noticeable difference. If it runs for for a long period, e.g. 30 minutes, it gets really bad, probably 500+ ms. And I have also noticed that pausing the stream and restarting it doesn’t necessarily get it to resync.

Tell me how to grab the data and I will.

I have several spare switches (just pulled a cisco SG-200-26 because I ran out of ports) I can use to do this but I don’t think it is a problem. Most devices, while connected to the network, are quiescent. But I’ll run a test with a dumb switch and two RoonBridge devices to see what happens.

[quote]One last thing to try–I don’t know the current status of IQaudIO’s Roon Ready image …
If you are running their image, you could try installing Roon Bridge on Raspbian instead if using their image. This would at least determine if the latest RAAT bits make the difference for you. If so, we’ll encourage them to push out a new version. [/quote]

I put a fresh load of Raspbian on each Pi; did the update/upgrade; and then used your RoonBridge loading script to install and configure RoonBridge. I did not use the image from IQaudIO. I’ve been messing with Linux long enough to know you can’t really count on anyone’s custom image.

Regardless, the problem occurs with my Mac Minis driving the BADA Alpha DACs too. So it afflicts both Linux and Mac RoonBridge implementations.

OK. For what it’s worth, I’m really good at breaking things. I make an excellent beta tester for that reason. Not only that, because I’m a hacker AND a tech-pubs guy, I write good bug reports.

Brian_Lloyd · October 27, 2016, 2:43am

Sounds like you are experiencing the same thing I am. I am beginning to suspect a bug.

BTW, the cisco SG-200 series is, IMHO, a managed switch, it just doesn’t have the L3 features of the SG-300 or the Catalyst switches. The SG-200s will do 802.1q VLANs, report traffic stats with SNMP, do some L2 policy/traffic-shaping, let you monitor a port or a VLAN through a mirror port, and you can configure/monitor it through its built-in web server. So I would definitely lump it into the “managed switch” category.

AMP · October 27, 2016, 3:07am

Ooops… my bad… They’re SG-100s. I’m using the SG-200s on a different network.

This is exactly the same behavior that I’m experiencing. In my case if I let it run long enough the audio will drop out, the track will skip, and everything will sync back up for a bit. Based on the response from @brian it sounds like my situation may be exacerbated by the newer RAAT implementation in the dCS Vivaldi.

I may have been imagining it, but I do recall reading somewhere that the master clock for the RAAT session is based on the primary member of the zone group. Ultimately I tried it two ways and regardless of which device was the master the end result was the same.

On the bright side I’m confident that I can easily reproduce the behavior.

brian · October 28, 2016, 3:38am

We definitely need to look into what’s going on. I briefed our support team today on next steps with this issue. They’ll be getting in touch and collecting/organizing the information so that we can proceed.

(cc @support)

Brian_Lloyd · February 19, 2017, 10:02pm

Well, it is now toward the end of February. There has been a major release of Roon and the clock-slip between players in a group is still an issue. I really would love to solve this problem.

Geoff_Coupe · February 20, 2017, 7:15am

1.3 has introduced sync adjustments and clock master priority settings for zones. Have you tried these?

brian · February 20, 2017, 7:47am

We identified and corrected some problems in this area with 1.3. I’m surprised to hear you’re still broken.

@support, lets get logs of a playback session that documents the point in time from the start of playback until drift is at the ~500ms level quoted above, and details of the setup where this occurs so we can try to reproduce this.

Brian_Lloyd · February 20, 2017, 6:31pm

Just to be sure I am not crying wolf, I manually upgraded all my roon-bridge (ArmV7 - RPi) players and am running several players in a group. It took about 30 minutes but then there was a sudden jump of about 200-300ms. So the problem is still there. Roon is v1.3 build 204 running on MacOS. All the players are Bridge 1.0 (build 66) On RPi using IQaudiO DAC, Digi, and Amp hats.

mike · February 20, 2017, 7:42pm

Hey @Brian_Lloyd – would you mind following Brian’s instructions below, and getting us some logs from the most recent occurrence? Instructions on sending us logs can be found here.

Thanks!