Corruption issue - "Can't read library - restore from backup"

Hi, folks.

My Roon Server is ROCK on a supported NUC. I was out of town for a few days, came back yesterday, launched a Roon client and received an error indicating that my library couldn’t be read and I needed to restore from a backup.

Same error on all clients. Used the web interface to restart, same issue. Power cycled the server, same issue.

Attempted restoration from my most recent backup from 3 days ago, same issue. Attempted restoration from the backup prior to that, and that got everything up and running.

I’m glad to be up and running again but this issue was a little surprising. I’ve used Roon for a few years and this is the first time I’ve had an issue like this.

Roon version is build 1311 (production) for Core and all remotes.

Posting here in case anyone has a theory for what happened and in case I’m not the only one to hit this. Any thoughts?

1 Like

Glad you are back up. I suppose one possibility would be SSD corruption, another is a database bug. The really concerning thing is that apparently Roon backed up what seems to be a corrupt database without warning, although they took measures some time ago to prevent that. (But there were some reports since that it might still happen under some circumstances).

I would recommend keeping the corrupt backup files and opening a support case. Maybe they can learn something from the files if you can provide it to them.

Thanks - I still have the backup that didn’t work.

SSD corruption feels like a bit of a stretch but of course isn’t impossible. Far more likely that it’s something up the stack. Maybe a database bug, maybe a Roon bug. Anyhow, doesn’t seem to be hitting tons of people or we’d have heard about it by now.

Thanks for the response - if Roon sees this and wants the backup, I’m happy to provide it.

Unlikely in Roon Software Discussion IMHO without a Support post

Reminds me of my support post here

@gTunes I could move this thread over to the support category, if you want, but you’d then need to add the standard info, as requested with that…just let me know!

1 Like

You probably had some event that caused corruption of your library during the time you were gone. Did you leave your Roon Server running?

Mostly tho, in spite of what Roon devs believe, your experience and (recently) others’ experiences show that Roon (in some circumstances) still backs up a corrupted library.

@danny?

I suppose it’s also possible that Roon did confirm backup health correctly at backup time, but later the stored backup got corrupted on the disk where it’s stored

Yes, that’s possible. However, multiple recent backups failed, but an older backup worked. It seems that if the backup was corrupted after it was saved then that would be a singular instance and if the entire backup medium had problems then a recent backup wouldn’t have worked.

Dunno.

Dunno either, but if a corrupt backup file was separate from the working one it could have been some bit flip or whatever in a specific disk sector/cell that was unique to affected backups

I am forced to invoke Occam’s Razor, brother. :upside_down_face:

2 Likes

In our company, we get tens of millions of automatic error reports from our software product per year. There’s a huge long tail with errors deep within some OS function that happened exactly once during many years, never before or after, although the code path is executed all the time and nothing changed in the area. We have resigned to attributing them to cosmic rays

2 Likes

My core is a headless NUC running ROCK which sits in my rack. I don’t power it off (and my wife was home while I was out of town). There was no power outage while I was traveling.

Appreciate the offer (and I can move threads, too) but I don’t think it’s worth the effort. I hadn’t seen your thread before - interesting read.

If I weren’t on ROCK, I’d run some tool to check the disk including checking SMART data. My gut tells me this isn’t a disk issue (we tend to roll our credentials out here to support the points we’re making, but I do have decades of experience with massive on-prem and cloud-based stores, storage systems, and complex stacks built on of them). Does ROCK have any facility for doing a SMART test (some URL that offers diagnostics or some CLI thing?) Sure would be nice if it did.

I’m pretty sure that the NUC BIOS doesn’t display SMART data anywhere but I could be wrong. Or possibly SMART errors get displayed during the boot sequence. If I run into this again, I’ll dig deeper.

If support wants logs or whatever, I can help (I also think they may be able to grab them on their own) but I don’t think I’ll work too hard to make that happen.

From the experience of my support thread, I can
a) backup a database that’s been identified as corrupt and
b) restore that database from a)
without Roon even burping…

2 Likes

Using Samsung’s own tools, I checked out my disk as being 100% healthy …

1 Like

This doesn’t surprise me one bit.

My SSD is a Samsung part, also. I can’t easily run something like Samsung’s Magician on a ROCK NUC. I suppose I could try something like PartedMagic. I may do that if the problem recurs.

Don’t know how often you do a backup, but the corruption needn’t have occurred during the period you were away. It occurred between the last good backup you could use and the oldest bad backup you couldn’t use.

That’s the real problem with backing up a corrupted library, in that the corruption can be latent, i.e. it doesn’t show up for some period of time.

1 Like

Default backup regime, which is every four days, I believe.

I grepped the logs to see what was there. With a bit of help from chatgpt, here’s the command line I used which finds all instances of leveldb, drops the lines without dates (those are indents for stack traces), and sorts by date.

grep -riE "[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*leveldb" . | sort -t '/' -k2,2 -k1,1

Logs in my case go back to 9/6. The first evidence of corruption is on 9/16:

./RoonServer_log.08.txt:09/16 19:48:50 Critical: Library.EndMutation: LevelDb.Exception: Corruption: corrupted compressed block contents

I was out of town at that point and nobody was using Roon. Nothing again until an error on 9/19 (these are in GMT so actually my 9/18 at the time that I tried to play something after getting back).

At that time, the same Library.EndMutation error shows up. It’s then followed by hundreds of errors similar to:

./RoonServer_log.06.txt:09/19 04:10:10 Warn: [queue] failed to load trackid 9815242: LevelDb.Exception: Corruption: corrupted compressed block contents

all referencing unique track ids.

I think I see the point in time where I tried to restore a backup which didn’t work. After that, there are no additional entries related to corruption. So at least I know that since restoring the working backup, there’s been no corruption detected.

I just made a copy of my backup directory. I do believe I have some good backups in there and I don’t want Roon to roll them out. I’d rather go back to a month-old copy than have to start from scratch.

1 Like

At least that answers the question about whether the corruption occurred in the library or in the backed up data

Not to belabor the point, but if Roon Server was running then Roon can be doing background updates on its own without anybody ‘using’ Roon.

Yeah, it seems that only way to have a reliable backup is to bring down Roon Server and backup all associated directories using third party backup software and set that aside as an archived copy not to be impacted by Roon Backup.
:

.

I have no idea what point you’re making or if you think you’re correcting a misunderstanding I have about how software works.

I’m not sure why this would be more or less reliable than one of Roon’s backups. If there is corruption in the database, then it’ll be there in the database and the backups unless something prevents that.

The error that is showing up is what you’ll see when the checksum (CRC) at the head of a storage block is inconsistent with the decompressed contents of that block. If a block is bad, it’s bad. This is, of course, about a logical block and doesn’t clarify whether the root cause is at the physical drive level or is software related.

Anyhow…I don’t think we’re debating anything, are we? :slight_smile:

Enough stuff for @support to dig into