The solution to DB corruption

Alex_Reusch · December 24, 2021, 6:12am

Seriously? Who uses the “contact me” for support, when there is a link named: “Help Center”? Especially if the only thing you can specify is: “I think something is broken”. So that’s how you should open a Support-Ticket?

ipeverywhere · December 24, 2021, 7:41am

You should update this with info about the 2 releases since the initial 880 which your article references. A lot of what you say is fair but Roon has identified a few issues since 880 which are software bugs not DB corruption and that isn’t reflected in your article. It’s not just about data integrity but software stability. Roon had a miss on the latter and knocked it down relatively quickly. The communication could have been better.

Alex_Reusch · December 24, 2021, 9:49am

The main problem is not fixed. Customers with data corrption are still left behind. Bad luck, that Roon did recommend their integrated backup tool, which is basically useless as it did not do a consistency check prior to 880. So there is no reason to update the post. Also, I think that just introducing a consistency check in the backup process is not good enough.

To be frank: Roon should have prevented data corruption in the first place. Why are they only now doing consistency checks as part of the backup process? Guys, this is best practices not since yesterday, but since years! Why is there no consistency check outside of the backup tool? In my opinion, such checks should run regularly in the background and create snapshots with an automated failback to the last good snapshot, if the system detects data corruption. That’s how a professional company that deals with valuable customer data would handle this. By the way, i can’t understand why a company would strategically choose performance over data consistency. Data consistency should always come first.

Bernd_Kurte · December 24, 2021, 10:06am

Obvious one. Regretfully nothing again. Not even some apology, rather the ‘not a crisis’ statement as not a significant number of customers(!) affected.

simon_pepper · December 24, 2021, 10:32am

Frankly insulting to those who are affected, i.e. you aren’t important enough to warrant our attention on, for a problem we created but can’t be bothered sorting out for you or even helping with some tools, special builds, recovery scripts on the existing backups!

Its worst than the ‘Window’s Blue screen of death’ because it is your individual data that is lost

simon_pepper · December 24, 2021, 10:37am

And they have experienced corruption in the Roon Core Database since 1.0 Build 30.
Here is an Email from Mike Faas, 16 July 2015 when I had a corrupt Roon Core database then.

Knowing that, some level of protection either on the data itself, in-flight or streamed down into backups or tools to scan for corruption could of been built into the Product to protect the user’s data.

Daiyama · December 24, 2021, 10:40am

At least Danny admitted, that they have done bad …

… and with „not a crisis“ he obviously meant for Roon but in my opinion, they have underestimated the value of the data for their customers. For them, it is a crisis when the data is lost.

Question: should I be scared for build 884, because I have a lot of tracks with strange letters in it?

simon_pepper · December 24, 2021, 10:44am

Backup prior to, using the Backup process but also the RoonServer folder on your Roon server.
You need to stop Roon first before touching this folder though.

Michael_Harris · December 24, 2021, 11:08am

But build 884 is the fix for the problem.
If you are on 880 or 882 it looks like you should definitely install 884

Daiyama · December 24, 2021, 11:22am

Thanks for the advise, will do that.
I had no issue with 880/882, that‘s why I was worried that the fix would scramble some of my tracks (as it seems happened to others).

simon_pepper · December 24, 2021, 11:24am

Trying a restore of a B831 backup to B884 to see if anything has changed in terms of the application of the update on the database once restored.

I have still have my new Roon database on my other NUC, which is now on 884 also.
In terms of stability this has been solid, just working on the structuring of my local library so it matches how Asset reads it from the metadata associated with the files, and re-identifying where Roon hasn’t been able to.

AceRimmer · December 24, 2021, 12:39pm

You totally missed the point, it was a rebuttal of your comment, nothing more.
Thanks and bye.

simon_pepper · December 24, 2021, 3:08pm

And no - restoring a B831 Backup into a B884 Build, just bums out with a system halt, so no change there.

Deckeda · December 24, 2021, 10:08pm

Been using Roon for about 2, maybe 3 weeks.

Nothing is corrupt. Please let me know the nature of whatever this is about because this post is written as an assumption.

Thanks.

Carl · December 24, 2021, 10:18pm

Hi,

There’s a huge amount of discussion on this topic, I’d recommend using the forum’s search function to investigate.

In the meantime here’s a copy of topics to start your quest…

Mike-48 · December 25, 2021, 8:59am

I maintain it would be nicer to detect the beginning of the failure and tell the user “change your SSD now and restore from a backup.” I don’t think that’s too tech. Even cars now tell you when to change the oil, my home air cleaner and furnace tell me when to change its filters, and so on. Much nicer to know it’s time than just find one day the thing doesn’t work.

Alex_Reusch · December 25, 2021, 9:42am

Guys, I have been working in the storage industry for +15 years. Solid State Drives (SSD) don’t behave like that, because SSDs do not contain erasable magnetic coatings. There is no sudden loss of data or corruption on untouched regions, this is incorrect. It does not mean that an SSD cannot fail suddenly, but creeping data corruption based on “untouched regions” is not a common phenomenon due to the technology used.

So SSD’s do use cells to store data, which have a limited number of write cycles. While this is true, wear leveling is a technique that most SSD controllers use to increase the lifetime of the memory cells. The principle is simple: evenly distribute writing on all blocks of a SSD so they wear evenly. All cells receive the same number of writes, to avoid writing too often on the same blocks. The lifetime of the cells differs for each NAND flash memory technology. Also, every NAND flash device uses Error Correcting Code (ECC) on the controller. The Bad-Block Management of SSDs ensures that data is moved from faulty areas (cells) to functioning cells. The defective cell is then excluded for future data storage and a new one takes its place instead.

In other words: A lot of security controls take care (Wear Leveling, ECC, Bad-Block-Management etc.) of your data on SSDs. While it is still possible that SSD devices fail or lose parts of data, it is very unlikely.

Joachim_Herbert · December 25, 2021, 10:56am

Unfortunately not the case: SSD failure rates not far behind HD

Alex_Reusch · December 25, 2021, 11:04am

Well, again… Not according to my personal experience. I had multiple customers with hundreds of Petabytes of storage and many thousands of drives. The failure rate of “traditional” magnetic spinning drives was by factors higher. And believe me, those customers hammered their systems with millions of IOs on a daily base…

And one more time: An SSD still might fail, but sneaking data corruption of “untouched regions” is not a common phenomenon of SSD technology. It was a more of a “possible” (also not common) problem on magnetic spinning disk, that’s why our systems used a special formating that included ECC (as it was not a common technology on the disk controllers then). However, SSD has mainly solved that issue built in the HW already.

Joachim_Herbert · December 25, 2021, 12:14pm

I think you need to put things into perspective. The short version (from the story linked above):

"In the first table, Backblaze shows the lifetime SSD and HDD failure rates starting from 2013. You can see that HDDs have a significantly higher failure rate than SSDs, making us think that SSDs are indeed much more durable than HDDs, like we’ve been told all along.

However, there are a few problems with this, the main one being drive age. Backblaze only began installing SSDs in 2018. But the company has data pertaining to hard drive health going all the way back to 2013, which is skewing the results quite a bit.

After taking into account drive age and equalizing it between SSDs and HDDs, we can see that the results have changed significantly. SSDs aren’t that far behind hard drives in failure rate, with a 1.05% annualized failure rate compared to 1.38%."