The solution to DB corruption

mikeb · December 19, 2021, 9:21pm

So, if DB corruption is detected on a ‘working’ setup, it just refuses to backup? Is there any repair mechanism, or is it then a Schrödinger’s Roon? Only solution is to restore from a good backup?

danny · December 19, 2021, 9:52pm

Correct. Backing up a corrupt database is wrong and we should neve have done it, and we don’t do it anymore.

No. There are some leveldb repair tools out there, but they basically are only fixing index errors, and in our experience, that’s not usually the case in our world.

In the 2 cases noted above, non-reversible damage has occurred. If the bits on the disk were modified or files were deleted, there is no recovery possible. That’s why we suggest backups and have built backup functionality into the core system. Unfortunately, pre-880 backups may also have saved corrupt databases because of that unfortunate lack of integrity check pre-backup.

SKBubba · December 19, 2021, 10:03pm

So, people were running for years with corrupt databases but no noticeable problems because the corrupt bits weren’t touched in a while.

In that case, could you write (or maybe it already exists) a utility to dump the database to xml or whatever, ignoring the errors and any related pieces, and then reimport the remaining intact parts, keeping 99% or whatever parts survive?

(Yeah, I know, everybody’s a software engineer on the internet.)

simon_pepper · December 19, 2021, 10:27pm

Hi, I have tried every Roon backup I have and no joy with restoring into a B880/882 build.

I take a Backup every 4 days and maintain a depth of 10 onto a NAS (so over a month of backups), then backup those another NAS, where there are nearly a year’s worth.
I have tried with a backup ROCK server, which had a Roon Core image from over a year ago, and no joy with B880/882.

So I have lost all my playback and library history from 2015, since Build 30, when I last had a problem with the Roon Core database.
I have lost all of my Roon based Playlists

If you could work out a method of extracting these from a B831 based backup, that B880/882 identifies as corrupt, that would be a step forward.

I have started a fresh build in B880/B882, all Albums have now imported but the analysis still has days to go (using all 4 cores of the NUC).
I have over 700 titles to re-identify, which was down under 100, and the Album count is way off what it should be, when compared to the UPnP server scan of the same files (I had this matching exactly) so there are days of rework required on a library to recover back to where I had over the last 6-years.

Sadly the user experience of upgrading and maintaining Roon has not been a good one with any of the 1.8 releases made. This really does take away from the functionality delivered in these updates.

danny · December 20, 2021, 12:03am

There are so many bad conclusions being deduced here from bad information (some on Wikipedia, some from misunderstandings, and some due to the lack of knowledge about how Roon works and what Roon means by “corrupt db”.

First, let’s start with the bad Wikipedia info and the poor interpretation of this information:

I just followed the very first citation to see what nonsense they were talking about because this statement is just false, and low and behold, I found this:

Repairing LevelDB

In the event of major hardware or filesystem problems, LevelDB can become corrupted. These failures are uncommon, but they could happen, as heavy loads can push I/O limits.

Well, I’ll be damned! When the hardware doesn’t do what it’s supposed to do, and you lack fault tolerance in the system the db sits on, things go bad! Tell me about anything on your current Mac or Windows system that isn’t subject to the same issue. Adobe Lightroom had this situation all the time. Then they started being pushy about backups and corruption. I still have a local database for Adobe Lightroom (too many photos for their cloud offering), and it reminds me to backup my database every time I close Lightroom. It also does an integrity check every time it does a backup. I’ve had 2 integrity checks fail on aging SSDs in the last ~10 years. Everyone’s just used to having things in the cloud where there is fault tolerance. Do you know what Chrome does when it comes across a corrupted database? It deletes it and you get logged out of everything, or it’s so bad chrome won’t start, so you end up uninstalling and reinstalling, blowing it away. Now they save all your passwords and credits cards in the Cloud to avoid this.

If you talked about databases 20 years ago, sure… but nowadays, there are more “noSQL” key-value stores out there than traditional RDBMS. Embedded databases usually stay away from being to be relational (besides the hugely popular sqlite). Additionally, leveldb is used in many places on your system already. One notable example is the IndexedDB implementation for Chrome.

While this is a 100% true statement, it’s also talking about something completely unrelated to high-performance local-disk databases. RAFT is a way to come to a consensus with distributed systems. The idea of using it locally on 1 system would be silly.

A better way to do this would be FEC/parity, but everyone expects hardware to be reasonable at not going bad and journaled filesystems to do their thing properly. If you really must have fault tolerance, you set that up.

The other fault-tolerant system is ZFS, and that’s a system where they’ve built parity into the system. They use a fast CRC (Fletcher4) to catch errors and then rebuild blocks from parity when errors are detected. They are doing basic drive fault tolerance, no different in concept to some of the RAID systems (5, 6).

No databases I know of, especially that meet our needs, would build fault tolerance into them. This is a problem for the lower layers.

While some RDBMS are more resilient, they are much slower and their resiliency usually stops at the indexes. If table data has been altered, most of these databases are not fault-tolerant. Very few of these databases build ECC into themselves for the actual data. Also, none of this has to do with the R in RDMBS. For example, the most popular RDMBS (sqlite) has no resiliency to hardware-driven corruption.

Not a bad idea! You’ve taught me something new. It’s so obvious in hindsight!

It does have checksumming to detect corruption, but as stated earlier, none of these databases actually have anti-corruption mechanisms that protect from hardware-driven corruption.

Those are installed long running databases meant for multiple clients and to be run in a managed environment, software running on your Mac/PC tend to use embedded databases. You will find leveldb listed there as it is a popular solution.

This I fully agree with, but the “leveldb” part is unfair. Running any database (or software) on these aging platforms puts your data at risk. There is nothing especially bad about leveldb here.

We see corruption rates in line with SSD failure rates. No surprise, that some of these systems that have been running for 5-6 years are starting to fail. It’s a pretty common path for SSDs to go kaput. Thus the backup recommendation. Lightroom is super aggressive about bugging you about backups, maybe we should get more aggressive as well.

The thing I’m talking to the team about now is seeing if we can do a few other things:

automatically, whether you set it up or not, do our own mini forced backup. One copy, just in case.
back up some parts of your database in the cloud, where we can hold your data in places that already have error-corrected storage.

oh yah, about what Roon means when it says “corrupt db”, I write about that up a few posts:

crowlem · December 20, 2021, 12:22am

It would be cool to have a “Roon Recommends” config in the backup menu. Since these issues a few people have been discussing backup strategies and what they have setup.

Presumably someone from the roon team would be able to give the best config to guard against data loss. Click the button, point to storage and done.

You could even have three options:

Maximum resilience (Requires most space, estimate xxGB)

Balance (Requires xxGB)

Lightweight (For storage limited situations only, requires xxGB)

Nathan_Wilkes · December 20, 2021, 12:44am

@danny , if Roon is planning to invest in Roon backup effectiveness, may I suggest that there are some aspects of backup visibility and viability that could be improved as part of the project. I wrote some ideas about this a few years ago, after discovering that “success” didn’t mean “usable”.

Most importantly though, and particularly relevant to this thread, is if a backup fails because of corruption, the user is informed in a visible and persistent way, not buried in settings.

Thanks.

danny · December 20, 2021, 12:45am

Oh it’ll be visible. If corruption is detected, you won’t be able to use roon until you reboot the Roon Core.

Chris_Davis2 · December 20, 2021, 1:00am

I assume that if someone is using a Nucleus or ROCK you are notifying them via the app or RoonOS web page that there are bad blocks on their SSD then?

I’m assuming this based on the fact that RoonOS blocks user access to the OS so we can’t run or access smartmontools/smartctl/smartdaemon to check the SSD errors/status ourselves and especially given that Roons recommended platform is ROCK?

danny · December 20, 2021, 1:04am

No… but Windows nor Macos tell you either… Linux does spit out stuff to the console by default, so you will see it on Roon OS, but you’d have to have an HDMI monitor/tv plugged in. If you want more effective hardware monitoring, you can run any other Linux distribution.

People running Roon OS shouldn’t care about bad blocks. Set up your backups and be done.

Chris_Davis2 · December 20, 2021, 1:10am

But you are saying that SSD failures/errors are causing the corruption but that roon users shouldn’t care about it?

People have been running backups but that didn’t prevent them from losing all their data?

PS windows/macOS and Linux can tell you it detects errors, it can even email you the errors using the daemon. You just have to set it up which is something you’ve blocked on RoonOS.

danny · December 20, 2021, 1:16am

Correct. One of Roon OS’s goals is to keep “computers” out of Roon. Just set the backup and use build 880. It’s not the solution for everything, just the majority that doesn’t want to futz around with computers and operating systems. Is it going to be as perfect as a system managed by a full time administrator? Of course not.

Yes, an error in the verification process. It would have been impossible for you to catch if you were monitoring for bad blocks as well since you have no way to check if the DB is corrupt without restoration.

Yes, but the entire point of Roon OS is that it can not do that. Why not just run Ubuntu if you require it?

Chris_Davis2 · December 20, 2021, 1:21am

So your answer is to wait for the corruption to occur and then replace the SSD.

The purpose of SMART is to warn you that a drive is beginning to fail so you can resolve before it becomes too serious and ECC can no longer be effective.

Jim_F · December 20, 2021, 1:23am

880 or 882?

danny · December 20, 2021, 1:27am

882 just has some fixes for some cases that were being mistaken for corruption, but 880 is the one that checks for corruption after things have gotten going and before backups.

The price you pay for running something trivial to manage.

Adding SMART support is not a bad idea, but I’m not convinced it’ll do anything. Have you ever seen it do anything predictive on an SSD?

We run SMART on Roon OS already, but the data is not reported anywhere via “notifications”. We’ve never seen any signs of bad on dead SSDs in Nucleus units that come back to us for repair.

ipeverywhere · December 20, 2021, 1:34am

Yes, but I bet you, if you offered a system to install 3, or 5, or 9, or whatever odd number of Cores for this sole purpose that someone would do it. Silly? Most people I know already think I’m silly with the infrastructure I have in my house. Would I do it? No because this is considerably more practical…

This and ECC memory are the real answers. Linus recently (Why don’t PCs use error correcting RAM? “Because Intel,” says Linus | Ars Technica) ripped Intel for not supporting ECC memory on their i processors. You could use ZFS on ROCK although a proper ZFS requires a few disks and cache etc. It would / could be a very interesting Nucleus build when combined with a board / cpu that supported ECC memory. I know I’d be significant more interested in Nucleus if it was built this way than the current consumer grade build it is today.

Chris_Davis2 · December 20, 2021, 1:34am

Yes we see it predicting drive failures within our VMware SSD cache and EMC SSD & HDD storage arrays.

Bill_Janssen · December 20, 2021, 2:11am

Sure, and it’s Google, after all, but anyone who’s had to struggle through Date’s An Introduction to Database Systems (or teach it) wouldn’t be quick to call a persistent key-value store a database. I take your meaning, though.

danny · December 20, 2021, 2:38am

An example I can think of from that era of computing history would be the famous Berkeley DB.

SKBubba · December 20, 2021, 2:41am

I actually still have that book on my bookshelf.

I once met Dr. E. F. Codd at a seminar on the concept of relational databases. Shook his hand after his presentation, and had a brief conversation about the possible applications for this new (at the time) technology. One of my few times being in the presence of genius.

Totally off topic, but your remark brought back memories.