Corruption issue - "Can't read library - restore from backup"

xxx · September 21, 2023, 6:53pm

Not at all. I wouldn’t presume to do that to someone who took the time to read Roon’s logs. I’m just saying that corruption can occur when Roon isn’t being used and therefore could have occurred while you were out of town.

The real problem, besides there being corruption, is that the backup of the corrupted library eventually overlays a potentially good backup. By archiving a third party software backup. out of the grasp of Roon’s Backup, you’re partially eliminating the chance that a good backup can be overlaid. Of course, nothing guarantees that backup isn’t of a corrupted library; unless one runs a Roon Backup at the same time and Roon reports a corruption. A lot of work and that’s assuming Roon can always find the corruptiuon.

No.

Suedkiez · September 21, 2023, 7:02pm

You can also do that with Roon backup by setting up different backup schedules to different, separate locations. In addition, whenever I make extensive edits I save a manual and again separate backup

xxx · September 21, 2023, 7:08pm

You mean, like this

I got whacked a couple of times with corrupted backups and had to redo tags, play lists, etc. from scratch. Which why I get on my horse when it seems like the problem still exists.

Suedkiez · September 21, 2023, 7:10pm

Sure, that sucks, but whether you back up a corrupt DB with separate Roon schedules or 3rd party software does not make a difference. I do very much recommend keeping separate ones in either case, not just one incremental Roon backup in one place (which may eventually replace a good one with corrupt files)

xxx · September 21, 2023, 7:12pm

Yep, no scheme’s foolproof if there’s a bug in the software.

Marin_Weigel · September 21, 2023, 7:14pm

Gentlemen, we’re missing a more important point here, since Roon had introduced measures to prevent backup of corrupted databases in build 882, see except:

Starting in Build 880, Roon now validates the entire database before performing a backup, which helps prevent Roon from backing up a corrupt database. This will ensure that the databases that are backed up will be valid for restore (assuming the media they reside on is not failing).
This change mostly affects behavior around the backup process, but also makes Roon more vigilant about detecting corruption during normal use of the product.

My disk still checks out 100% healthy, yet there‘s latent corruption rearing its ugly head in a seemingly random pattern, can be back-uped and restored without Roon stumbling over it.

Looking at the evidence, I have a hard time believing my hardware is failing.
I think Roon needs to investigate further to find out what exactly is happening.

Suedkiez · September 21, 2023, 7:18pm

We’re not - we know but it is precisely the issue that @gTunes apparently still experienced a corrupt backup without warning after that change, leading to this thread. And so did you and @xxx, of course

xxx · September 21, 2023, 7:36pm

My criticism in this thread, perhaps not made strenuously enough.

And in this thread -

And here -

Roon’s idea of a bad disk causing corruption has never been the case in the times I experienced corruption. After a Restore, I continued to run with the disks that I always had been using.

Backup problems hurt everybody, not just the OP.

Roderick_Miller1 · September 23, 2023, 3:06am

I had maybe the very same issue. ROON Rock on a NUC. Current Roon Builds etc. I would get the very same error message, “Can’t read library…” Started periodically, and soon was happening daily.

It turned out that my NUC’s RAM was either incomptible or I had a bad stick. The NUC and RAM worked fine with Windows and on my laptop. But once changed out with new RAM, the NUC worked flawlessly. I have not had any issues since then. And Kingston replaced the RAM under lifetime warranty, so I was happy.

Mike_O_Neill · September 23, 2023, 6:48am

Just to add my 2p. I live in a country where we get regular planned power outages , euphemistically called Load Shedding.

My NUC BIOS is set up to “Restart on Power Restore” which it does quite happily and reliably.

I know power spikes are not a good thing but in the 18 months of NUC/ROCK running like this I have yet to see any corruption of my library. I will fix it if and when I do, I keep meticulous back ups off the NUC.

Before I get hoards of advice about UPS’s they are of little use in this instance as the internet goes down too and is outside of my control. The Pb UPS’s also discharge so far that they trash the battery in a few months. I have other battery driven means of playing music

I have a UPS on my main PC simply to allow me to close Windows nicely

BlackJack · September 23, 2023, 12:38pm

How backups work and fail from my onlooker perspective:

Initial situation: Roon works on a master DB held in RAM. There is also a (main) copy of that DB in file(s) on disk for persistence that may not be current on all times (deferred write back of changes for performance reasons).

What does backup in Roon do: The master DB in RAM gets checked for consistency/errors, if none get found the content gets flushed to disk (make the main copy on disk current) and then halts activities to prevent further changes during the backup process. The backup process then sends a copy of the main copy to the specified backup location, assuming that the main copy is error free. If this is finished, error free or not, the main service resumes its work (displaying an error message if the backup [copy process] failed). If the initial DB check fails, you see the dreaded “There was an issue loading your database …” and Roon refuses to create a backup. If there is a flaw with the backup location, the backup might become corrupted but unless this flaw leads to an abort of the copy process, Roon doesn’t know about it.

Assuming that my assumptions about “How Roon’s DB works” is correct, then the only DB that is ever checked by Roon is the master DB in RAM. If there are (so far undetected) issues with the storage device that holds the master copy of the DB then Roon doesn’t know about it (no check) – the master copy is then potentially flawed/defective but gets backed up anyway and the user ends up with a backup he can’t restore. Further more, if a situation arises that leads to Roon relying on the main copy (reboot of the machine, stop then later start of the Roon Server service), he may get confronted with the
dreaded “There was an issue loading your database …” seemingly out of the blue.
Note: Many Roon Server installations run 24/7 for very long time spans during which they rely solely on the master DB in RAM.

Does this sound familiar? Does it explain the experience users describe in support threads? IMHO yes.

I don’t know how feasible or practical a check of the main copy on disk and preferably also of the created backup copy would be as integrated part of the offered backup process. The later can be done manually by simply restoring a backup. As with all manual tasks, they tend to be tedious and therefore often get ignored by users. Doing it anyway might also uncover serious issues with the storage device that holds the main copy but there is likely no way to immediately recover should that be the case – the user has to fix the issue first before he can run Roon Server again on that machine (Note: If only the backup is flawed for some reason, Roon will auto-recover).
Why it might not be practical for Roon to do those checks as part of the offered backup process? Some users have big libraries and so Roon DBs too. It might already consume a substantial amount of time to just check the master DB in RAM. The storage that holds the main copy, even if it is an SSD as recommended by Roon Labs, is magnitudes slower and access to the backup location might be even slower (thumbdrive, network, internet). So Roon might be down (current implementation scheme) or show signs of degraded performance (if those steps and checks after the initial check of the master DB get offloaded into a separate process so that Roon Server can resume its service before the backup ends) for (many) hours and might become impractical for users with really large libraries (runtime of days rather than hours).

Charles_Peterson · September 23, 2023, 3:47pm

I had a similar issue with my initial ROCK NUC build - every time I went to do an OS update, it would come back with the dreaded database message. And often the database was corrupted. Roon suggested it might be a hardware issue. Rebuilt it with new RAM and SSD into a fanless case and no problems for a few years since.

Marin_Weigel · September 23, 2023, 9:36pm

Not im my case, let me recap, and for the record, I startup/shutdown my Roon server every day:

Database corruption detected one day during use, demanding restore.
Rebooted instead, system came back without a problem.
Took a database backup of that database.
Another corruption occurrence two months later, this time no success on reboot, so restored said last backup and voilá, back running again.

So?

BlackJack · September 23, 2023, 9:54pm

Your case is not about a corrupt backup or the backup process but rather about a corrupt master DB in RAM. In the second case at least some of the corrupted data already made it to the main copy on disk.

So this still totally fits in. I can see no contradiction to the observations and assumptions I shared.

xxx · September 23, 2023, 11:08pm

All these suppositions are interesting, but database corruption would be less of an issue if Roon (still) didn’t blithely backup a corrupted database until all good backups were also corrupted.

dailymango · September 24, 2023, 10:35am

I have had the exact same behavior. I did extensive checks on the memory with the help of support and found nothing out of the ordinary. Now i no longer turn off my core (rock) and it has happened only once since. Feels like a bug to me.

crom · September 24, 2023, 4:16pm

Just to add my 2p but I had exactly the same problem. I back up to 2 separate locations (local system drive) and across the network to a NAS drive. Frequency is a couple of times per week. The local backup started to fail. Then the network backup started to fail. I run a headless server. I scratched my head for a bit until I realised that the local system drive was full!

As far as I can see, Roon was trying to check the local db before backing up and failing because it ran out of space either during the validation or the snapshotting process. If this proves to be correct, it might help others to add a check for sufficient disk space prior to initiating these processes and reporting that via Roon UI (to save other muppets from scratching their heads )

Why is this pertinent to this post? Well, what pushed me to investigate this in the first place was that every night or so, I opened Roon and received the error “Can’t read library - restore from backup” … I ignored it for a time, rebooting the server seemed to solve the problem but eventually I gave up and tried to restore from backup which sent me down the rabbit hole described above.

xxx · September 24, 2023, 4:46pm

That’s a slightly different problem from a corrupted library…

Here, the real problem is that Roon puts out the same non-descript, generic error message for many different errors that are unrelated.

It’s like the devs don’t know how to Throw or Catch errors or don’t care…

gTunes · September 24, 2023, 5:05pm

@BlackJack - some of what you’ve written is accurate but I believe you’ve made some fault assumptions, too. There’s more to the execution model than you’ve represented including a write-ahead transaction log, background compaction of the log, and data compression.

At the end of the day, LevelDB’s storage format is a collection of 4k (at least by default) blocks on disk. Each of these blocks contains one or more records. LevelDB compresses these disk blocks. The header of each block is a CRC checksum of the uncompressed contents of the block. This gives LevelDB the ability to verify the contents of the block at read time - it can decompress the block, checksum the decompressed contents, very that against the checksum on disk.

From the logs and from my read of LevelDB blocks, the “corruption” issue I ran into is specifically about issues with “corruption” being detected as a result of checksum failure. I could be wrong here but that’s what the logs errors appear to imply.

If I’m right about the nature of the corruption, then it’s hard to point a finger at Roon as the culprit. It’s not entirely impossible that some Roon software issue or regression is related to the issue but it seems far more likely that it’s an issue in or below LevelDB. In other words, a lurking LevelDB bug or an issue with the drive or RAM.

Where Roon is most certainly able to do better has to do with backing up corrupt databases. There are at least two ways that Roon could do a better job at preventing this. They are:

LevelDB has a “paranoid” mode (off by default) that causes it to verify checksums for every read. Roon could run in this mode at all times (possibly would hurt performance in a measurable way, possibly not - it’s by no means a given that it would, but worth benchmarking). Or, while doing a backup, could close the database, re-open in paranoid mode, do sequential reads of everything. If no error, do the backup, if any error, skip the backup, alert the user. Even if this took 5 minutes, which it probably longer than it would in actuality, that’s a tradeoff I suspect most of us would make.
Switch to RocksDB for all or part of their operations. RocksDB is a Facebook extension to LevelDB (originally from Google but I’m not sure how much active development is happening). RocksDB has more options for data validation including background integrity checking.

In any case, I do think Roon has some culpability here and should be trying very, very hard to not create corrupt backups when avoidable.

crom · September 24, 2023, 5:35pm

Apols - I’ve added a para explaining why it’s relevant.