Recurring database corruption due to image issue requiring manual intervention (ref#13MXR2) [Ticket In]

Since my last post, I another restart sadly permanently broke playback. So, 12 out of 13h no playback (local nor Qobuz) is succeeding. Another restart just now provides no relief.

UPDATE Random observations to add:

  • ROCK refuses to reboot some of the time, while Roon is failing to playback on some device.
  • Then when rebooting succeeds, it still fails playback.
  • Meanwhile, ROCK being inscrutable makes it impossible for me to track any system health indicators beyond the Roon log files.
  • Loading the home screen takes about 2 minutes after a restart. O darn, it’s still not fully loaded. It apparently loads top-down in sequence.
  • History also fails to load, maybe I’ll let that sit for some time too: Nope. Not even after 10 minutes.
  • No rescans stuck this time. Maybe it’s because of the failed backup? Redirecting my backup to a different location… Just gets stuck doing database snapshot. Yup. Restarting the client shows it stuck in exactly the same state as usual: no playback, no artist page load, no history page load. No new backups have been made.

To rule out complete hardware failure, let me move this entire party to different hardware. It would be nicer if I still had my old linux installation so I could actually diagnose it live.

Oh. Darn. I booted from a live Nix installation and proceeded to get smartcontrol data: http://stackoverflow-sehe.s3.amazonaws.com/25/03/16/7a3ab39c-f4d7-488f-8601-5dc3784f5861/M.2_SSD_128GB_2208V0226A000322_2025-03-16_0049.txt

Looks clean and even the extended self test uncovered no bad blocks. Same with badblocks from e2fsprogs.

HOWEVER. Then I moved focus to RAM. And the story is different:

I think we have ourselves a solid lead here. This doesn’t look right at all.

Thanks for the additional information @S_Heeren - we’ll share it with development and discuss potential next steps.

Thank you for your continued patience in the meantime! :raised_hands:

Hi @S_Heeren,

The failure in the screenshot you’ve included is pointing at bad RAM - we suggest replacing it with new RAM, and then testing for similar issues.

@benjamin That’s why I told you guys… I assumed you got that the first time you replied :slight_smile:

I’ve been running with the faulty DIMM removed, but without restoring any database. Looks fine now. However, I realized that a LOT of work went into tidying up my Roon library aside from the parts I already mentioned before (playlists, MUSE profiles, history and statistics). So I’ll probably select a suitably old backup (I should have one from early 2024) and re-test with that somewhere next week. If that works (:crossed_fingers: ) then I should be out of the woods.

1 Like

Sounds good @S_Heeren, certainly keep us in the loop on how things perform over the next week or so. :+1:

I’ve run with a new database-from-scratch for a few days, realizing that I was missing a LOT of album edits (identifying box-sets and customizing cover-art between different re-releases) as well as all the play-lists and MUSE profiles.

So I went back to the recent oldest back-up I had (December 25th 2024). It ran well for about 8d, but then disaster struck and the database was found corrupt at a restart.

So now, I’m trying my luck with a very old backup that I found lingering on an old medium (Jan 13th 2023). That’s really old. But at least contains some of the work I’ve done.

Sadly, I still have intermittent glitches

  • covers not loading until I zoom the art in Album View
  • specific library albums missing (e.g. “Viola Sonata, Shostakovich”)
  • the search for that specific string consistenly crashes the Windows client, locks up the Android client (until force closed).
  • Worse, both these clients got stuck in a crash loop because a restart brings back the same search screen. The only way to get out was to reinstall the clients

Interestingly, the iOS client executes the same search without problems (literally repeated it from the “recent searches” drop down, no server restarts in between) and shows the album that was missing from the play history (no album art, “album not found” when clicked)… Even AFTER that, the Android and Windows clients still kept crashing on the same search.

I don’t know what to do here.

[PS. I’ve repeated about an hour of memtest86 a few days back just to make sure no new failures had appeared with the remaining DIMM - it ran completely clean, can’t be paranoid enough]

Some screenshots to illustrate:


image (clicking on that Shostakovich album)
(the crashing search in the search history)

Hey @S_Heeren,

Sorry to hear you’re still running into issues. Could you please share a set of Roon logs from the issue windows machine? Here are the directions found here and send over a set of logs to our File Uploader.

Thank you!

@benjamin Thanks for the interest. I’ve not been able to reproduce the crashes (I think (?) I’ve updated server from 1510 to 1517 in the mean time). However, the weird album art glitches still present themselves:

And the album cannot be displayed: afbeelding

However, the search now completes:

I’ve uploaded the log file for these three actions from the windows client

Hi @S_Heeren,

We’ve taken a look at the diagnostics you shared from the client-side and compared them to logs from RoonServer during the corresponding time period. Here are the patterns we observe:

  • RoonServer is logging generic network failures when requesting content from Qobuz’s servers. Qobuz playback requests occasionally time out. This occurs when playing to System Output.
  • Network reachability changes interrupt the connection between RoonServer and the upstream servers responsible for certain background processes.
  • Certain image requests from Qobuz or Roon’s discovery service return 404. There aren’t any caching issues.
  • Roon doesn’t show signs of corruption events during indexing.

Reviewing this thread, it seems you’d like to target either RAM/resource constraints or latent corruption as the underlying problem here and not focus on loading failures.

For background: Roon attempts to detect corruption “on the fly” during background indexing by checksumming every page in the database, on a single core. That doesn’t guarantee that tiny changes can’t accumulate.

The only way to ensure that a DB doesn’t contain latent corruption at this point would be to fully recreate it; we recognize how time consuming and frustrating this can be. However, a current hardware failure inducing corruption independently in each restorated Backup is also a possibility.

What do you see in the RoonOS Web UI when you reboot and it fails? What happens if you reinstall RoonOS itself?

Also, we understand you’re resistant to troubleshoot your network. But what is the basic topology serving this ROCK?

I don’t remember the exact message, but the web page dryly informs me that “the roon server failed to reboot” (or similar message). I’m not currently considering re-implementing ROCK (since it doesn’t give me the basic means of system monitoring or maintenance; I prefer to keep my system up-to-date and be able to monitor for health).

Not really hesitant, but I had the feeling it was unlikely culprit. I think the finding of the faulty RAM DIMM confirmed that hunch. The network topology is really simple: it’s all directly-attached 1Gbit LAN cables to the source router (Asus RT-AX92U, which is 2.5Gbit capable) connected to a fiber-optic WAN operating 1Gbit network.

The LAN is rock solid (the non-wired clients use a WiFi mesh with 3 WiFi6 access-points, bridging the the same subnet as the LAN, but the server and desktop are connected by cable anyways).

That’s good to hear. I’ll probably keep running with the current DB then. I have a suspicion that particular album I focused on to show “missing artwork” is somehow no longer available on Qobuz. The fact that the search triggered a crash is no longer reproducible so I’ll write it off as a freak incident.

The only reachability changes should be:

  • A daily router restart at 4:30am (the router firmware has a bug in their TZ database so for one week that would appear as 3:30am in the logs). I’m unlikely to be listening at that time, but it does happen
  • We’ve gotten an upgrade enabling 8Gbit upstream fiber, which left us with no internet for a few hours on April 1st (not a joke, or at least not a good one :slight_smile: )

I might have tinkered a bit with the firewall setup (to allow for ARC) but that was longer ago, I think. I’ll leave the network config stable for a few weeks at least, so we can see that the only “holes” are at 4:30am now.

PS. There’s still 8Gb of RAM for basically just roon, so that seems plenty. Disk is SSD. 1GBit LAN.

Thanks again for the detailed update @S_Heeren!

Certainly valid points - keep us in the loop on how things perform over the next few days and we’ll keep the thread monitored for your results. :+1:

slight update to keep this alive.
Another update made things unstable again. Now went back to regenerating entire database from scratch :frowning: Will post conclusions later

Hi @S_Heeren ,

Sorry to hear that the issue is still ongoing, yes, please let us know your conclusions when possible.