Nucleus Roon Core death spiral

stevebythebay · May 1, 2018, 4:34pm

My Nucleus faded away earlier today. Started with sudden music stop from Internet radio. Able to restart but then it stopped again. Lost access to Core via iPad app. Could not get to web-based interface, though I could ping the Nucleus. Eventually even that failed. Had to literally pull the power and plug in to get things working again.

Running Core 1.4 build 310 and I doubt this is a network problem as everything else has been solid. Nucleus is wired to Cisco smart switch. That switch is wired to Eero WiFi access point, QNAP NAS, and dCS Network Bridge.

Almost seems like Roon or some other service may have a memory leak, consuming and failing to release real system resources leading to eventual crash.

Anthony_B · May 1, 2018, 5:58pm

I don’t think the logs are hidden as such:
https://kb.roonlabs.com/Logs

stevebythebay · May 1, 2018, 7:05pm

Thanks for pointing this out. I was able to follow and substitute my Nucleus IP to find the appropriate folder and logs. Time stamps are a bit odd. Guessing that what I’m seeing is GMT 0. I don’t recall in the installation any ability to set time zone nor that NTP or any service for that is running, and likely as not, that’s probably a good thing to keep the system as light as possible. Anyway I did see quite a few warnings in the log prior to the current one about running out of threads. And from my cursory reading of the file, there were seeming reboots initiated on an almost hourly basis – not by me. Hoping someone in support can get a handle (no pun intended) on what might have led to the system hanging, and forcing me to literally pull the power.

danny · May 6, 2018, 8:22pm

Your analysis is mostly correct. We need a timestamp, not a timezone. This is extremely common for servers and not at all odd NTP is running to keep the time correct, but it also doesn’t care about where your server is on Earth.

Looking at your system, something is definitely funny here. This only seems to have happened once – does that sound right? It looks like your system was fine for ages and then went berserk over a span of just a handful of seconds. Is there a chance you might have a web browser on the web UI for the Nucleus? Do you have anything connected to the Nucleus’s SMB share? I’m trying to figure out what parts of the system were active (besides Roon).

stevebythebay · May 6, 2018, 9:15pm

I’ve done nothing to the Nucleus environment that you’ve mentioned, no browser active or even used, only to gain access after this occurrence. And no SMB shares that I created, just what Roon points to as the folder on my NAS. And you’re right that this is the first and only time I had this happen. That’s why I thought it was a problem with the Roon app rather than the OS. But it could be something running on the OS other than Roon which is the source of the problem.

I gather that the logs are no help in finding the source of the problem, right? And since it’s not a regular issue is it safe to say there’s no obvious way to trap or trace the problem?

danny · May 6, 2018, 9:28pm

You are spot on. The logs are focused on the hardware and RoonServer – but it appears something else caused your system to freak out and RoonServer was affected by this – the logs from RoonServer point to this reasoning, but have no direct evidence of the misbehaving component.

I am putting on the dev schedule the need to gather more information about the rest of the system so we can diagnose situations like this.

Please let me know if you experience this again. Thanks.

stevebythebay · June 11, 2018, 6:47pm

Just happened again today. Was playing internet radio all morning, no problems. Then started up an album and by the 3rd song it suddenly stopped. Went to restart the song and it failed again. Then lost iPad access altogether. Able to ping the Nucleus and even scan the services but cannot get to the web interface.

What should I do at this point?

Update: waited about 10 minutes and then got the web page back. Able to reconnect with the Nucleus via iPad and restarted the “failed” song. All is well. Decided this time not to do any restart of Roon or reboot of the Nucleus pending your look see at the unit. I didn’t want to wipe out any bread crumbs that may have been left in the wake of this failure. Would still hope to know what you discover and what you might need me to do going forward.

AndersVinberg · June 13, 2018, 7:09pm

@danny A bit more detail on a few Nucleus crashes.
Happened all within minutes — more precisely, three crashes followed each other without successfully playing between them.

I wanted to compare 24/192 and MQA 24/192 Versions of the same album (Coltrane’s Crescent).
I had played 2’40” of the Crescent track.
Hit the MQA version, selected the Crescent track, Add Next, skip to next, pointed to about 2’20” into the progress (all this before any music started), and it crashed. This time the OS crashed, box didn’t respond to the browser. Didn’t need to power cycle, it came back, although it seemed to take longer than normal.
It remembered its state, the position on the MQA track, I hit Play, crash immediately. This time the OS was responsive, the server came back.
I think this happened again. Eventually I got it to play.

Later, I was able to do those operations successfully. I tried a more gentle procedure, AddNext, SkipForward, wait for sound, jump to 6 minutes, and the more abrupt approach, jumping to six minutes immediately after skipping to the new track without waiting for sound, both worked.

The signal path:

stevebythebay · June 13, 2018, 9:53pm

Danny: It would certainly help us all with Roon (on any platform) to have some guidance on how best to both capture and report these random events. Maybe some guide to troubleshooting and steps to take that will help your team get to the source of these. How best to preserve the “scene of the crime” so others can do their computer type forensics most effectively. Maybe a procedure or two on helping to investigate things on our own as well as best approaches to recovery. I know we customers are all trying to help Roon be the best it can be.

mike · June 14, 2018, 3:39am

Hey @stevebythebay @AndersVinberg – the most important thing for us is timestamps for when the instability occurred. If you let us know when this occurred, we can enable diagnostics and look at what was happening.

Obviously, reproduction steps are great too, but general details about what was happening ahead of the crash (like you’ve both provided here) are also really helpful.

We’re going to enable some diagnostics now, but if you have some rough timestamps for the reports above, that would be helpful.

Thanks guys – appreciate your patience on this.

AndersVinberg · June 14, 2018, 5:08am

It was a few minutes before my post, and a few minutes before the time stamp of the signal path screen capture.

stevebythebay · June 14, 2018, 5:23am

Based on my post it happened on the 11th probably at 11:15 AM PDST or thereabouts. Do you have access to my system’s logs? If not I believe I can access them and try and make sense of them or send them to you. Thought Danny was able to do this the last time this occurred on May 6th. Tell me how best to get access to whatever files you need and how best to package them up and who to send them to, if that’s the process you’d want us to follow.

noris · June 14, 2018, 8:46pm

Hello @stevebythebay,

I can confirm that we have successfully received your logs and they have been added to your case. I will be sure to let you know once QA has completed their review and has passed us their report. I appreciate your patience here in the meantime, we will be in touch soon.

Thanks,
Noris