In fairness Danny recognised this
In fairness Danny recognised this
Yes, I have seen this.
leveldb, from my understanding of its history, was designed with no data integrity checks of its own. It expects that is handled external to the DB. This is the natural way of things in data centers but not so much on home / residential set-ups.
There are databases where data integrity is the predominate feature but they are difficult to set-up and maintain. (something like this is used by a few of these DBs https://raft.github.io/)
Then there are sort of the in-betweens where the inherent nature of the writes can prevent corruption, that is the database itself won’t cause issues for the software reading the DB, but would result in data loss or put another way… Roon wouldn’t crash it would just look like you never did the thing you know you did. The drawback here is, again, speed and complexity and you still have data loss.
leveldb is a fine choice of DB as long as you’ve got mechanisms in place to provide the missing data integrity gaps inherent in its design. This is where Roon can keep working to make things more robust if they choose to continue to use leveldb. Making sure a backup is sane is a very good first step and thank you.
On a personal note: My Roon install keeps the DB on a mirrored set of drives formatted ZFS. ZFS is a monster of overhead for a filesystem with the primary goal being data integrity. The Core also utilizes ECC memory. A study of ZFS https://www.usenix.org/legacy/events/fast10/tech/full_papers/zhang.pdf
Straight to the point from Danny
Correct. Backing up a corrupt database is wrong and we should neve have done it, and we don’t do it anymore.
I’ll add one more thing while I’m thinking about this. Not all filesystem formats are created equal and your OS of choice matters here. There are even changes within NTFS between older and newer versions of Windows. Apple released a new default filesystem, called Apple File System (APFS) with High Sierra which provides benefits for SSDs (among other things). Those running on really old platforms are not seeing any benefits from these advancements. Running something like levelDB on old, aging, platforms puts your data at risk. Get your stuff upgraded
Can you fix my DB then if you’re bored, Danny?!
jk-lah, get well soon.
Please check out the release notes for build 882. If that build didn’t fix your problem, then you have a corrupt database due to either a hardware failure changing the bits on the storage, or a critical file was deleted ( for example, by your operating system due to improper shutdown).
A restore from backup is the only solution, assuming you have one and it wasn’t corrupt as well. In build 880 onward, we always check database for corruption before backing up the database, to avoid the situation where you could have corrupt backups.
I was joking Danny, sorry.
jk-lah means I’m joking, apologies for the confusion.
I have 2 cores, one I updated for my mom flawlessly but mine I’m too scared to do the update.
I don’t think the corruption is due to HW failure to be honest. I believe it’s because of leveldb which is known for having a lot of corruption bugs.
Hope you’re not letting the kid win!!
When we state a database is corrupt, it does not always mean the level db is corrupt.
It can mean:
the level db has reported corruption. LevelDB is notorious for corruption, but only in 2 situations:
1a) the bits of the level db indexes have been manipulated by underlying hardware, or:
1b) parts of the level db are missing. This can sometimes happen due to antivirus software, or an operating system that has been shutdown improperly and decides to kill the file that was not properly closed out.
the leveldb read fine, but when we read the records, the data inside the record could not be parsed as valid. This always points to hardware errors. This is the more common of the two cases, and it makes sense since the content of the db is much larger than the overhead of the DB structure itself.
We also check both of these types of corruptions statistically, and as a %, they are constant with the growth of our userbase and match the failure rates we see of SSDs in the wild. The SSD failure rates are based on the tens of thousands of Nucleus units we have sold as well as industry numbers out there (which are pretty closely matching).
I know it’s going around these forums that level db is easily corruptable, but that’s just people trying to make sense of something they don’t know the full picture about.
Our advice is to have a backup of your Roon database.
Unfortunately, we had a shameful fault here, in that the backup functionality backed up the database files and did not do a full analysis of the db’s content before backing up. Therefore, it would backup any databases that were “corrupt”. This was resolved in build 880, in that it catches the corruptness before backing up.
So, if DB corruption is detected on a ‘working’ setup, it just refuses to backup? Is there any repair mechanism, or is it then a Schrödinger’s Roon? Only solution is to restore from a good backup?
Correct. Backing up a corrupt database is wrong and we should neve have done it, and we don’t do it anymore.
No. There are some leveldb repair tools out there, but they basically are only fixing index errors, and in our experience, that’s not usually the case in our world.
In the 2 cases noted above, non-reversible damage has occurred. If the bits on the disk were modified or files were deleted, there is no recovery possible. That’s why we suggest backups and have built backup functionality into the core system. Unfortunately, pre-880 backups may also have saved corrupt databases because of that unfortunate lack of integrity check pre-backup.
So, people were running for years with corrupt databases but no noticeable problems because the corrupt bits weren’t touched in a while.
In that case, could you write (or maybe it already exists) a utility to dump the database to xml or whatever, ignoring the errors and any related pieces, and then reimport the remaining intact parts, keeping 99% or whatever parts survive?
(Yeah, I know, everybody’s a software engineer on the internet.)
Hi, I have tried every Roon backup I have and no joy with restoring into a B880/882 build.
I take a Backup every 4 days and maintain a depth of 10 onto a NAS (so over a month of backups), then backup those another NAS, where there are nearly a year’s worth.
I have tried with a backup ROCK server, which had a Roon Core image from over a year ago, and no joy with B880/882.
So I have lost all my playback and library history from 2015, since Build 30, when I last had a problem with the Roon Core database.
I have lost all of my Roon based Playlists
If you could work out a method of extracting these from a B831 based backup, that B880/882 identifies as corrupt, that would be a step forward.
I have started a fresh build in B880/B882, all Albums have now imported but the analysis still has days to go (using all 4 cores of the NUC).
I have over 700 titles to re-identify, which was down under 100, and the Album count is way off what it should be, when compared to the UPnP server scan of the same files (I had this matching exactly) so there are days of rework required on a library to recover back to where I had over the last 6-years.
Sadly the user experience of upgrading and maintaining Roon has not been a good one with any of the 1.8 releases made. This really does take away from the functionality delivered in these updates.
There are so many bad conclusions being deduced here from bad information (some on Wikipedia, some from misunderstandings, and some due to the lack of knowledge about how Roon works and what Roon means by “corrupt db”.
First, let’s start with the bad Wikipedia info and the poor interpretation of this information:
I just followed the very first citation to see what nonsense they were talking about because this statement is just false, and low and behold, I found this:
In the event of major hardware or filesystem problems, LevelDB can become corrupted. These failures are uncommon, but they could happen, as heavy loads can push I/O limits.
Well, I’ll be damned! When the hardware doesn’t do what it’s supposed to do, and you lack fault tolerance in the system the db sits on, things go bad! Tell me about anything on your current Mac or Windows system that isn’t subject to the same issue. Adobe Lightroom had this situation all the time. Then they started being pushy about backups and corruption. I still have a local database for Adobe Lightroom (too many photos for their cloud offering), and it reminds me to backup my database every time I close Lightroom. It also does an integrity check every time it does a backup. I’ve had 2 integrity checks fail on aging SSDs in the last ~10 years. Everyone’s just used to having things in the cloud where there is fault tolerance. Do you know what Chrome does when it comes across a corrupted database? It deletes it and you get logged out of everything, or it’s so bad chrome won’t start, so you end up uninstalling and reinstalling, blowing it away. Now they save all your passwords and credits cards in the Cloud to avoid this.
If you talked about databases 20 years ago, sure… but nowadays, there are more “noSQL” key-value stores out there than traditional RDBMS. Embedded databases usually stay away from being to be relational (besides the hugely popular sqlite). Additionally, leveldb is used in many places on your system already. One notable example is the IndexedDB implementation for Chrome.
While this is a 100% true statement, it’s also talking about something completely unrelated to high-performance local-disk databases. RAFT is a way to come to a consensus with distributed systems. The idea of using it locally on 1 system would be silly.
A better way to do this would be FEC/parity, but everyone expects hardware to be reasonable at not going bad and journaled filesystems to do their thing properly. If you really must have fault tolerance, you set that up.
The other fault-tolerant system is ZFS, and that’s a system where they’ve built parity into the system. They use a fast CRC (Fletcher4) to catch errors and then rebuild blocks from parity when errors are detected. They are doing basic drive fault tolerance, no different in concept to some of the RAID systems (5, 6).
No databases I know of, especially that meet our needs, would build fault tolerance into them. This is a problem for the lower layers.
While some RDBMS are more resilient, they are much slower and their resiliency usually stops at the indexes. If table data has been altered, most of these databases are not fault-tolerant. Very few of these databases build ECC into themselves for the actual data. Also, none of this has to do with the R in RDMBS. For example, the most popular RDMBS (sqlite) has no resiliency to hardware-driven corruption.
Not a bad idea! You’ve taught me something new. It’s so obvious in hindsight!
It does have checksumming to detect corruption, but as stated earlier, none of these databases actually have anti-corruption mechanisms that protect from hardware-driven corruption.
Those are installed long running databases meant for multiple clients and to be run in a managed environment, software running on your Mac/PC tend to use embedded databases. You will find leveldb listed there as it is a popular solution.
This I fully agree with, but the “leveldb” part is unfair. Running any database (or software) on these aging platforms puts your data at risk. There is nothing especially bad about leveldb here.
We see corruption rates in line with SSD failure rates. No surprise, that some of these systems that have been running for 5-6 years are starting to fail. It’s a pretty common path for SSDs to go kaput. Thus the backup recommendation. Lightroom is super aggressive about bugging you about backups, maybe we should get more aggressive as well.
The thing I’m talking to the team about now is seeing if we can do a few other things:
oh yah, about what Roon means when it says “corrupt db”, I write about that up a few posts:
It would be cool to have a “Roon Recommends” config in the backup menu. Since these issues a few people have been discussing backup strategies and what they have setup.
Presumably someone from the roon team would be able to give the best config to guard against data loss. Click the button, point to storage and done.
You could even have three options:
Maximum resilience (Requires most space, estimate xxGB)
Balance (Requires xxGB)
Lightweight (For storage limited situations only, requires xxGB)
@danny , if Roon is planning to invest in Roon backup effectiveness, may I suggest that there are some aspects of backup visibility and viability that could be improved as part of the project. I wrote some ideas about this a few years ago, after discovering that “success” didn’t mean “usable”.
Most importantly though, and particularly relevant to this thread, is if a backup fails because of corruption, the user is informed in a visible and persistent way, not buried in settings.
Oh it’ll be visible. If corruption is detected, you won’t be able to use roon until you reboot the Roon Core.
I assume that if someone is using a Nucleus or ROCK you are notifying them via the app or RoonOS web page that there are bad blocks on their SSD then?
I’m assuming this based on the fact that RoonOS blocks user access to the OS so we can’t run or access smartmontools/smartctl/smartdaemon to check the SSD errors/status ourselves and especially given that Roons recommended platform is ROCK?
No… but Windows nor Macos tell you either… Linux does spit out stuff to the console by default, so you will see it on Roon OS, but you’d have to have an HDMI monitor/tv plugged in. If you want more effective hardware monitoring, you can run any other Linux distribution.
People running Roon OS shouldn’t care about bad blocks. Set up your backups and be done.
But you are saying that SSD failures/errors are causing the corruption but that roon users shouldn’t care about it?
People have been running backups but that didn’t prevent them from losing all their data?
PS windows/macOS and Linux can tell you it detects errors, it can even email you the errors using the daemon. You just have to set it up which is something you’ve blocked on RoonOS.