Grouped Playback exhibiting Clock Drift

Brian_Lloyd · February 20, 2017, 8:24pm

OK. Logs zipped up and sent to support@roonlabs.com.

Brian_Lloyd · February 20, 2017, 8:27pm

Logs not sent as apparently "support@roonlabs.com" is not a valid email address.

Rugby · February 20, 2017, 9:39pm

The instructions on transferring the log files from the link above ask you to either, put the log file into Dropbox and give them a link to them. Or, tell them you don’t have access to Dropbox and ask them to provide a way to upload the logs directly to them.

brian · February 24, 2017, 9:44am

@Brian_Lloyd, I was hoping for a smoking gun in your logs (like huge round-trip times during clock synchronizations), but it wasn’t there.

So I tried to reproduce this–I set up two Pi’s. One with a IQaudIO DAC Pro, and the other with an IQaudIO Digi+. I’m at three hours of playback, checking on sync every 10-15 minutes and it’s still perfect. I’ll let it run overnight and check it in the morning. Both are running latest DietPi+stable Roon Bridge.

Do you think there’s anything else I might need to do to see this fail?

Brian_Lloyd · February 24, 2017, 1:54pm

There shouldn’t be huge round-trip times given that all there is between the server and the players is a cisco switch. There is buffering going on in the switch because the server’s e-net is running at 1Gb and the Pi’s are running at 100Mb but it seems to me that should actually reduce the arrival time differential.

You mention DietPi. What is that? I am running the standard release of Debian for the Pi. Is there a different kernel I should be running on my Pi’s?

And the server is running on a Mac Mini. What are you running your server on?

I just don’t know what to tell you. The sync slip is there. Sync is good for awhile and then it suddenly changes. If I leave it I can her the time different increase over time. It always seems to go in jumps. More than that I can’t tell you.

Perhaps the question is, by what mechanism is it possible for there to be a substantial meta-stable time differential?

Sorry to be a pain. I really like the product. If we could solve this it would be nearly perfect.

AMP · February 24, 2017, 2:06pm

What kernel version?

As far as I know dietPi runs a very recent release whereas Debian tends to be quite a bit more conservative.

Brian_Lloyd · February 24, 2017, 2:38pm

Does the current kernel version on the Pi matter? What in the various revs of the kernel could effect this issue? Again, it gets back to, “What COULD cause the sync slip problem?” Since the slip is a meta-stable state that goes in jumps of a significant and quite-audible number of milliseconds, and not constant drift, it strikes me as a buffer offset, not a bit offset, in the stream.

Since I don’t know the details of your synchronization protocol I cannot make any sort of reasoned suggestion. But I do recall a similar kind of problem from the dim, dark recesses of the past. Many years ago (about 30) I was experimenting with running TCP/IP over packet radio. The radio was a half-duplex link. I was varying link-layer parameters in an attempt to optimize throughput on links with varying reliability and found I could make the TCP round-trip timer trend to zero. When this happened the source would start going into retransmission before the link could even transmit and turn-around to receive the ACKs. When I relayed this to Phil Karn, he was able to recreate the problem. His discovery was that, once TCP fell into retransmission, it was no long possible to determine RTT since you could not tell the difference between an ACK for the original segment and an ACK for the retransmitted segment. In that case you needed to toss any RTT values derived during retransmissions. This fix solved a lot of problems and became known as the Karn Round-Trip Timing Algorithm.

So this is only a thought about how assumptions we make about RTT could be erroneous.

And I am completely shooting in the dark here.

AMP · February 24, 2017, 2:51pm

Well you’re seeing an issue that @brian couldn’t reproduce on what is likely a much newer rev so most other things being equal this very well could be the difference that makes a difference.

To answer your question there could be any number of things in the kernel… buffer sizes and clearing mechanisms, process scheduling, network drivers, ALSA drivers, firmware, and any number of other things. RAAT playback is two kernel threads (IRQ handlers) moving data with RAATServer controlling the process. Most of what is going on is taking place inside the kernel.

The next logical step here is for either @brian to replicate your Debian install on his Pis or for you to try loading up dietPi and seeing if that makes a difference.

Brian_Lloyd · February 24, 2017, 3:18pm

I have a fair amount of experience with kernel timing delays causing problems with stream buffer processing in software-defined radio. (Very similar to audio processing.) There one ends up with a transient anomaly, e.g. pops and or stuttering, not a meta-stable time slip.

No, that is not completely true. I can remember where the the delay kept increasing indefinitely in quantum jumps. I think it had to do with an error in buffer replication in a stream due to retransmissions.

And part of the problem is in not knowing which end is introducing the added latency.

brian · February 24, 2017, 4:46pm

My overnight test is still playing in sync ~11hrs later. I can see in your logs that this isn’t a network quality issue or an issue with round-trip times during synchronization. The logs don’t contain every rtt, but the worst case times are around 1ms, and the clock measurements are low-pass-filtered, so it would take consistent bad measurements to jerk it around.

We use UDP for clock synchronization. Sometimes we get a misleading result when a WiFi router buffers packets asymmetrically…but you’re not on WiFi and your RTT’s are very stable and small.

Anyways–a reckoning issue related to round-trip-time was my first guess too. That’s what the logs disproved.

The sudden-ness is the strange thing. I don’t think it’s a buffer-sized error–the buffers are timestamped, much smaller than hundreds of milliseconds, and the system is designed to play silence when data is missing or not there in time. This has to be some confusion in one of the endpoints about what time it is.

Also strange is that even uncorrected, your clocks wouldn’t have drifted by hundreds of milliseconds. In the most recent log, the relative drift between the most distant pair was ~65ms.

(In my overnight test, that number–the total correction amount was 1.5 seconds–I have a bigger difference in clock rates from Pi to Pi than you do, and the test ran much longer)

A couple of other thoughts…but not good ones. The key to fixing this is going to be reproducing it:

Is there any chance you have time synchronization stuff (NTP) running periodically on the Pi’s? I don’t like the theory (we should be isolated from it).
Any chance that there are thermal issues? Sometimes when CPUs “save themselves” it does really weird things to their idea of time.

I don’t know of a reason why the kernel should make a difference, but we’re not in a normal place with this one. I used DietPi because it’s quick/easy to install, you can pick the IQaudIO and Roon Bridge stuff off of a menu and have it setup without fooling around, and their maintainer hangs around here and fixes problems–so it’s become the most popular choice for our users. Before that I ran Raspbian-based Pis for months and never ran into this, but I wasn’t doing attentive testing like today.

We did have one user who found that one of his several Pis had a hardware problem that messed with zone linking. Symptoms were different than yours, but not 100% inconsistent, swapping the hardware (With same SD + HAT) made is troubles disappear. We never figured out what was wrong with it–after the problem was isolated, he just replaced it and moved on.

Brian_Lloyd · February 24, 2017, 5:38pm

OK, thanks for the info. Strange problem.

Yes, all my Pi’s are running ntpd. If ntpd is present I usually start it on whatever machine I am running.

I have noticed that the ntp implementation on the Mac is broken. They don’t run it as a background process, but rather kick it off with chron periodically which results in clock correction in jumps. I keep meaning to get in and figure out where they are doing that so I can make it work properly (true chimer) and be a peer for all my other clocks. (Maybe I should just punt MacOS and switch to Linux. MacOS is definitely headed down-hill.)

Thermal problems, eh? That is a possibility. I am having an airflow problem in my AV cabinet for the main system and I know temps get higher than normal in there at times. Since that is my central system I can hear in all rooms, I tend to make it one of the systems I group with. The other two systems are in my master bedroom and in my office, which are on opposite sides of the house. I’ll try just grouping those two to see if that might be the issue. Of course, if I open up the front of the AV cabinet there is no thermal issue.

I’ll group just the MB and office systems to see if that has an effect.

brian · February 24, 2017, 7:16pm

Ok. Let me know how it goes. I’d like to get to the bottom of this. If you can isolate to one Pi, or determine that the problem is kicked off during NTP sync on one of the devices or something like that it will be a clue.

Brian_Lloyd · February 24, 2017, 7:53pm

ntpd is running normally on the Pi’s. That means there is no discontinuous time adjustment on them. Their clocks are being continuously “tweaked”. MacOS is another issue. I know its clock adjustments are discontinuous. Hmm, it would be pretty easy to test if I just disabled ntp on the server. I’ll try that.

So, things to try:

Different clients, not my main system;

disabling ntp on the mac server.

Time to let it run and see what happens.

Brian_Lloyd · February 24, 2017, 8:10pm

OK, about 10 minutes in listening to an album I am VERY familiar with I was able to hear the FORWARD jump on the system in the office even without being able to hear the system in the MB. When I heard the jump I went out to where I could compare the two. Now there is about a 50-75ms difference.

(I sing and play jazz. I am VERY sensitive to time changes in music.)

Hmm, is it possible that a sudden discontinuity in the RTC on the server could create a sync jump between two of the grouped systems? I know that happens on the server because of the brain-dead way Apple is using ntpd.

Brian_Lloyd · February 24, 2017, 11:18pm

OK, I turned off automatic time update in MacOS. (Apple’s brain-dead use of ntp.) The system has been playing with three zones in a group for over 2 hours with no sync slip. Is it possible that an RTC discontinuity in the server could cause a mismatch of timing?

Brian_Lloyd · February 25, 2017, 12:43am

And about an hour later I finally got the sync slip.

brian · February 27, 2017, 7:06pm

I don’t think the RTC jump on the server could cause a sync slip. It would have to be on an endpoint–they are synchronizing with each other directly.

So the office skipped 50-75ms of audio at once. That’s not behavior consistent with the drift compensation stuff (we don’t do sudden adjustments–it’s always pulled in line over time). That sounds more like the audio driver flushed some audio data down the drain without telling us. The scale is right too–typical buffer sizes are in the 40-100ms range. So that could be two 40ms buffers or one larger one.

This is really starting to sound like a lower-level issue than us. If the audio driver were to skip part of the stream, we would lose our sense of time permanently.

It is weird that that would only happen during grouped playback. I almost think you should be able to reproduce this symptom in the office without grouping too.

In terms of isolating…you might play with this setting. It lets you pick a priority order for determining the “master” clock. The master device is doing the exact same thing as it would be during single zone playback (only slaves are doing drift compensation).

The question is–if you set clock master priority on the office to 1, and leave the others at default:

Do you still get the slip?
Is the slip still in the office?

Brian_Lloyd · March 8, 2017, 12:28am

Sorry for the delay. Evidently the notification got lost and I didn’t know you had replied.

I have changed the clock master priority to make Office #1, Living Room (main system) #2, Master Bedroom #3, and back porch and hangar are “default”. Office and Living Room are grouped now. Let’s see.

I do think I have heard sync-slip in the office when just listening to that zone (not grouped) but if I am not listening critically I can’t say for certain. So you think it might be a low-level driver problem in Jessie Lite? ALSA? That is what I usually run on my RPi’s.

Brian_Lloyd · March 8, 2017, 4:40am

After making the recommended change and making office the clock master I haven’t heard any loss of sync even after 4 hours of playing. hmm.

I think I will try refreshing the office Pi with DietPi. It seems to me that, the fewer things that are running, the fewer problems one is likely to have.

Brian_Lloyd · March 11, 2017, 5:51pm

Ok, converted from Jessie-lite to DietPi. All is working. I have roonbridge on the Pi in the office set as the highest priority clock master. I have it running, grouped with the main system in the living room running for a couple hours now. No sync slip.

Several hours later: Yup, it is working perfectly now. No sync-slip anymore. I call this one closed.