Qua Continuum, Continuum One causes Roon Server to crash during Audio Analysis (see February 3, 2019 posts)

Thanks @vova!

I had hoped that you might have had a -DDEBUG build available with more ASSERT()s built in. Hopefully the latest crash which did trigger an ASSERT() may help … but yeah, the crashes are all over the map. The Good News™ is that at least the stalls and the other failures all seem to be related to crashes of one sort or another some causing Roon Server restarts and some merely lobotomizing the Roon Server making it incapable of functioning.

Casey

@noris, @vova,

So is there any more information I can provide? Tests to run? Should I upload my current Roon Server Database so it can be checked for corruption? Are there special binaries/debug flags we can try?

I’ve pretty much given up trying to listen to music since the Roon Server is so flakey. At some point I’ll have to give up and deinstall Roon and reinstall the old Logitech Squeeze Server. This would be sad because: A. I bought a Lifetime Roon Subscription, and B. at some point I’m hoping to get a new DAC …

Casey

@noris, @vova, @mike,

New experiment:

I’m currently trying to do a complete regeneration of my Roon Database. I had mentioned the possible hypothesis of whether it could be corrupted in some fashion and causing the crashes. @mike, said to just stop the Roon Server, move /var/roon/RoonServer/ off to the side, and then restart the Roon Server. It’s been running now for ~10 minutes and has imported ~8,200 tracks, so it should be done in a bit over two hours. I’ll leave the system alone and won’t try laying anything till it’s finished.

Casey

At the risk of getting ahead of myself – if the instability comes back with the new database, another interesting experiment might be to:

  • Make a fresh db
  • Only import a few pieces of content, like a folder with 20 tracks in it – in fact, if you have a streaming service, maybe even leave the local media aside completely
  • Let the subset of content finish importing
  • Start playback, turn on loop, and leave it overnight

If the instability comes back, it will be really interesting to see if this is related to media.

Thanks Casey!

Well, the “Good News™” is that the new Roon Database rebuild ended up with the Roon Server crashing again with SIGSEGVs and other aborts. So something about my FLAC/DSD/etc. collection is causing the Roon Server extreme heartburn. I’ll see if I can figure out any pattern when I get back from lunch …

Casey

Tomorrow I’ll start trying to use a restricted set of my music library to cause the Roon Server failures. But in the mean time, I thought you might enjoy the latest ASSERT() that I triggered tonight:

Why are we accessing an entry that is not allocated
Stacktrace:
      at <unknown> <0xffffffff>
      at (wrapper managed-to-native) object.__icall_wrapper_mono_gc_alloc_obj (intptr,intptr) [0x00008] in <370a0c27f4b74d1a81431037df6d75bf>:0
      at (wrapper alloc) object.AllocSmall (intptr,intptr) [0x00077] in <370a0c27f4b74d1a81431037df6d75bf>:0
      at Sooloos.Broker.Query`3<INTERNALTYPE_REF, PUBLICTYPE_REF, THREAD_REF>.NotifyEndUpdates () [0x00099] in <dcbcf4f91fec401e800a83ba373c6044>:0
      at Sooloos.Broker.QueryManager/<>c.<NotifyEndUpdates>b__14_2 (Sooloos.Broker.IQuery) [0x00001] in <dcbcf4f91fec401e800a83ba373c6044>:0
      at Sooloos.Broker.QueryManager._ForEachQuery (System.Action`1<Sooloos.Broker.IQuery>) [0x00033] in <dcbcf4f91fec401e800a83ba373c6044>:0
      at Sooloos.Broker.QueryManager.NotifyEndUpdates () [0x00159] in <dcbcf4f91fec401e800a83ba373c6044>:0
      at Sooloos.Broker.Music.Library.EndMutation () [0x00232] in <dcbcf4f91fec401e800a83ba373c6044>:0
      at Sooloos.Broker.Music.Module.ev_exit () [0x0000c] in <dcbcf4f91fec401e800a83ba373c6044>:0
      at (wrapper delegate-invoke) <Module>.invoke_void () [0x0006d] in <370a0c27f4b74d1a81431037df6d75bf>:0
      at Sooloos.SynchronizationContextThread.OnExit () [0x0000b] in <7f0a74b68d2a4a0ba3084b62b8028591>:0
      at Sooloos.SynchronizationContextThread._Dispatch (Sooloos.SynchronizationContextThread/SendOrPostWrapper&) [0x00029] in <7f0a74b68d2a4a0ba3084b62b8028591>:0
      at Sooloos.SynchronizationContextThread._Go () [0x00025] in <7f0a74b68d2a4a0ba3084b62b8028591>:0
      at System.Threading.ThreadHelper.ThreadStart_Context (object) [0x0001f] in <370a0c27f4b74d1a81431037df6d75bf>:0

There’s more to the stack trace but it’s all just initial thread startup/dispatch …

Casey

This may be a stretch, but have you considered testing your RAM modules? RAM flakiness has been known to cause random crashes on code that uses a lot of memory (like Roon). The OS might work OK because one DIMM is fine, but when more memory is needed, the other DIMM gets used, and crashes occur.

@Fernando_Pereira,

Yeah, I thought about Memory Errors. (It’s very frustrating to me that no reasonably priced home computers have Parity, let alone ECC when we’re looking at home systems now sporting 32GB like mine.) I’ve been running “memtester” from the Ubuntu distribution and it hasn’t found any issues. Right now I have 7 copies running, each sucking up 4GB of memory …

Meanwhile, I’ve been working to construct smaller subsets of my music collection centered around items which were added just before mid-December 2018 when the problems arose. The theory I’m exploring is that perhaps one of them is causing the Roon Server or one of the support libraries issues. On Friday I managed to reproduce the error with a relatively small 51GB subset comprised of 32 albums. Today I just managed to reproduce it with a 29GB subset of that 51GB collection, comprised of only 11 albums. It’s a difficult process since the issue reproduces randomly. So now I’ll have to slowly create a third subset pulling albums from the second subset …

Casey

1 Like

@Fernando_Pereira, @noris, @vova, @mike,

(woof) And now I’ve isolated the problem down to a single 8.4GB album consisting of a single 53m:11s track in DSD256/DSF format: Continuum One by Qua Continuum. The Roon Server was busy “Analyzing” that track (which is a very quiet ambient track all the way through) when it crashed. I’m looking for the source of this now …

Hhmmm, this seems to be the source:

and here’s an interesting article on how the above in the DS256 format is causing some players fits:

https://dsd-guide.com/can-your-playback-system-handle-53-minute-song-dsd256-let-us-know

Casey

@Fernando_Pereira, @noris, @vova, @mike, @Alan_Maher, @Martin_Webster,

Ah ha! And it turns out that someone else has complained about this exact issue:

I’ll have to grind through that thread to see what was discovered …

Casey

By the way, just a thought on what might be causing the problem: This is a 53 minute Ambient Music track with very little in the way of dynamic volume changes. I wonder if this is causing a fault in the Dynamic Volume Analysis code which leads to something akin to a Divide-by-Zero or and Off-by-One error and a write beyond allocated memory?

Casey

Hello @Casey_Leedom,

Thanks for posting your findings and for narrowing the issue down to that one 53 minute track! I didn’t quite expect this issue to stem from just one variable but glad that you were able to locate the issue.

I will be discussing that track with QA and Dev team later this week, but since you have isolated the issue to that specific track, has removing it allowed Roon to work as expected again with no issues?

Thank you again for your patience and willingness to dive deeper in troubleshooting this issue.

– Noris

Hey there @noris, I’ve started this test now. It’s running now and I’ll let it run for a few days without that track in the collection. In the past it’s run for up to several hours before crashing. But I’m fairly confident that we’ve tracked the problem down. There may be other tracks which trigger the same or similar problems, but that will have to await your team determining exactly what’s gone wrong and looking for other similar issues.

The Good News™ is that this has exposed several issues that I think Roon Labs will need to start tackling on the long haul — scaling being a really big one. Roon has become one of the most popular Music Servers out there. What will you guys do when you have several hundred thousand users all hitting your servers? You’re definitely going to need a CDN if you don’t already have one. And issues like your DNS TTL on accounts5.roonlabs.com set to 60s and others like metadataserver.roonlabs.net set to 5m; these are ridiculously small TTLs. I see logins to accounts5.roonlabs.com on the order of once an hour. With, say, 100,000 users that would be ~28 logins/s. Probably doable, but I don’t know how much processing you’re doing on your side for that. And what happens if there’s a network partition? Does that mean that Roon will no longer work? I have a friend who regularly loses power and network connectivity for several days on end during the winter …

Casey

@noris, @mike, @vova,

Progress report 24 hours in:

No Roon Server faults! Yay!!

The Log got rotated around 3am this morning, but not associated with a Roon Server restart as I’ve been seeing the last six weeks.

I want to see it run without fault for at least a week before declaring victory since it has sporadically “worked” for up to two days before. However, I’m fairly convinced that the Roon Server faults I’ve been dealing with for the last six weeks are almost certainly related to the 8.5GB DSD256/DSF Qua Continuum, Continuum One single-track album. I expect that it should be fairly easy for your team to figure out what’s going wrong with the Audio Analysis process for this. My guess is a Buffer Overrun and Memory Corruption.

Casey

2 Likes

Hello @Casey_Leedom,

Thanks for the heads up. We’ve already requested the track. Will make a bug report after receiving the media.

Regards,
Vova

1 Like

Hi @Casey_Leedom,

I just wanted to let you know that I brought theses infrastructure suggestions to the dev team and they’re definitely in line with some changes we’re already planning. They appreciate the feedback, and so do I!

Thanks,
Noris

1 Like

So, progress report something like 9 days in: no more Roon Server faults and the User Interface no longer suffers long delays. So I think that it’s nearly 100% certain that all my problems were the result of the Roon Server’s indigestion with the Continuum One 53m:11s DSD256/DSF 8.5GB track. Hopefully a fix can be deployed since I’ve now got my Music Collection rolled back and frozen. Thanks!

Casey

@noris, @mike, @vova,

Have you guys received a copy of the Qua Continuum, “Continuum One” track yet? Have you been able to reproduce the problem if so? Curious Minds want to know …

Casey

Hi @Casey_Leedom,

We have received a copy of Qua Continuum from Cookie but the ticket is still pending review by the dev team and they have not had a chance to investigate the underlying cause yet.

We still plan to address this issue but unfortunately it’s not looking like it will make it in our next release. As soon as I am able to provide more information regarding this issue I’ll be sure to let you know.

Thanks,
Noris

Hi @Casey_Leedom,

We’ve made a change here in Roon 1.6 (Build 416) that we believe should address the behavior reported here. Please update Roon and give this a try. You can read the full release notes here: