The solution to DB corruption

Looks like that is one of the fixes in 880.

If so, it’s a start. Repair tools?

Apparently the server just stops if it detects corruption. You need to restore from a backup.

Well, if the db really is checked before backing up, that should at least limit the damage to a few days’ work. Not ideal, but surely better than what some users are reporting.

1 Like

There is this: Smithfarm - the Brain: How to repair a leveldb database

Wonder what it actually does?

1 Like

Good question. I suppose it is like CHKDSK - making the db correct in structure, though possibly with data loss. What else could it do?

Reliability in modern software is far better than I remember form 20, or even 5, years ago. I can’t remember the last time I experienced a problem. I hope that Roon Labs will find a way to bring their database up to modern standards.

I looked at the code (https://files.pythonhosted.org/packages/48/6e/9da3c29c0cbeb5871241ef154f11196867712ee3adc4077a94fa5f7d9dbd/leveldb-0.201.tar.gz, the file leveldb-0.201/leveldb/db/repair.cc, if you’re curious). It rebuilds the DB from the log files.

1 Like

Interesting but see the suggestion? Use a better backup system. Is he being sarcastic or he just doesn’t care? I’m puzzled by this attitude, it alienates the whole purpose of having backups.

If Google Chrome & AutoCAD use such DB, don’t they have checksum mechanisms? I mean we are not talking about a toy for a few thousands audiophiles but entreprise class application servicing millions of users.

1 Like

https://community.roonlabs.com/t/i-think-roon-should-not-be-crowd-sourcing-it-s-metadata-for-content-enrichment/182415/44?u=gigatoaster

I got an answer from Danny.

So saying leveldb is prone to corruption is false in the case of Roon?

I would say it would be better to stop attacking a straw man if ever there was one. How the corruption was caused is largely irrelevant. The relevant bit I see is why Roon would let corrupt database backup happen for so a long time. Corrupt backups defeat their purpose and are not apt to help recover from a unusable database. How that database became unusable is another discussion altogether.

4 Likes

This thread is very useful, Danny weighs in quite a bit in this message and later on.

https://community.roonlabs.com/t/i-think-roon-should-not-be-crowd-sourcing-it-s-metadata-for-content-enrichment/182415/44?u=johnny_ooooops

In fairness Danny recognised this

.sjb

Yes, I have seen this.

leveldb, from my understanding of its history, was designed with no data integrity checks of its own. It expects that is handled external to the DB. This is the natural way of things in data centers but not so much on home / residential set-ups.

There are databases where data integrity is the predominate feature but they are difficult to set-up and maintain. (something like this is used by a few of these DBs https://raft.github.io/)

Then there are sort of the in-betweens where the inherent nature of the writes can prevent corruption, that is the database itself won’t cause issues for the software reading the DB, but would result in data loss or put another way… Roon wouldn’t crash it would just look like you never did the thing you know you did. The drawback here is, again, speed and complexity and you still have data loss.

leveldb is a fine choice of DB as long as you’ve got mechanisms in place to provide the missing data integrity gaps inherent in its design. This is where Roon can keep working to make things more robust if they choose to continue to use leveldb. Making sure a backup is sane is a very good first step and thank you.

On a personal note: My Roon install keeps the DB on a mirrored set of drives formatted ZFS. ZFS is a monster of overhead for a filesystem with the primary goal being data integrity. The Core also utilizes ECC memory. A study of ZFS https://www.usenix.org/legacy/events/fast10/tech/full_papers/zhang.pdf

2 Likes

Straight to the point from Danny

Correct. Backing up a corrupt database is wrong and we should neve have done it, and we don’t do it anymore.

1 Like

I’ll add one more thing while I’m thinking about this. Not all filesystem formats are created equal and your OS of choice matters here. There are even changes within NTFS between older and newer versions of Windows. Apple released a new default filesystem, called Apple File System (APFS) with High Sierra which provides benefits for SSDs (among other things). Those running on really old platforms are not seeing any benefits from these advancements. Running something like levelDB on old, aging, platforms puts your data at risk. Get your stuff upgraded :slight_smile:

2 Likes

Can you fix my DB then if you’re bored, Danny?!

jk-lah, get well soon. :kissing_heart:

Please check out the release notes for build 882. If that build didn’t fix your problem, then you have a corrupt database due to either a hardware failure changing the bits on the storage, or a critical file was deleted ( for example, by your operating system due to improper shutdown).

A restore from backup is the only solution, assuming you have one and it wasn’t corrupt as well. In build 880 onward, we always check database for corruption before backing up the database, to avoid the situation where you could have corrupt backups.

2 Likes

I was joking Danny, sorry.

jk-lah means I’m joking, apologies for the confusion.

I have 2 cores, one I updated for my mom flawlessly but mine I’m too scared to do the update.

I don’t think the corruption is due to HW failure to be honest. I believe it’s because of leveldb which is known for having a lot of corruption bugs.

Hope you’re not letting the kid win!!

When we state a database is corrupt, it does not always mean the level db is corrupt.

It can mean:

  1. the level db has reported corruption. LevelDB is notorious for corruption, but only in 2 situations:
    1a) the bits of the level db indexes have been manipulated by underlying hardware, or:
    1b) parts of the level db are missing. This can sometimes happen due to antivirus software, or an operating system that has been shutdown improperly and decides to kill the file that was not properly closed out.

  2. the leveldb read fine, but when we read the records, the data inside the record could not be parsed as valid. This always points to hardware errors. This is the more common of the two cases, and it makes sense since the content of the db is much larger than the overhead of the DB structure itself.

We also check both of these types of corruptions statistically, and as a %, they are constant with the growth of our userbase and match the failure rates we see of SSDs in the wild. The SSD failure rates are based on the tens of thousands of Nucleus units we have sold as well as industry numbers out there (which are pretty closely matching).

I know it’s going around these forums that level db is easily corruptable, but that’s just people trying to make sense of something they don’t know the full picture about.

Our advice is to have a backup of your Roon database.

Unfortunately, we had a shameful fault here, in that the backup functionality backed up the database files and did not do a full analysis of the db’s content before backing up. Therefore, it would backup any databases that were “corrupt”. This was resolved in build 880, in that it catches the corruptness before backing up.

9 Likes