Warning, Samsung 870 EVO 4TB SSD prone to failure

I think the discussion hasn’t really taken into account the two main aspects of digital storage.

tl;dr: data corruption is an issue that can occur on error-free and 100% healthy storage media.

Totally simplified, make sure to distinguish between:

  1. The content (the user data, in this case your files of music in some format like FLAC perhaps)

  2. The physical medium
    a. the permanent storage (be it HDD, SSD, NVME etc.)
    b. the active storage (RAM - in the form of DIMMs or SODIMMs etc.)

1 and 2 are completely different things.

It’s important to distinguish the two, because the OP has the problem that his content (the files) have gotten unreadable, and he looks to a problem with his nice top tier physical storage medium, the 4TB Samsung SSD, for the cause, but the SSD tools (eg. Samsung Magician) doesn’t find any errors on the medium.

But your content can go bad (get corrupted) without any fault or defect of your storage medium.

Unless you have a complete chain of error correction storage via ECC RAM in your server / NUC or PC or whatever PLUS an error correcting file system on your physical storage medium, then your content / data can get corrupted and destroyed at any time.

And data does get corrupted, all the time. The more data you have, the more statistically your data will get corrupted. It’s simple math.

You simply MUST assume that statistically your data will in fact get corrupted - through bit flip errors and cosmic radiation.

Additionally, enterprise class storage documents the likelyhood of an error occuring through the media itself, without any defect in the media - how many bits written to the medium before a write error happens. And when an error occurs, how does the medium deal with it (TLER). Notice, the error cannot be corrected by the medium, even if it detects that an error occured.

This is why enterprise storage systems (the whole storage setup) use ECC Memory and Error Correcting Resiliant Filesystems → for example ZFS.

And no: RAID is not error correcting. RAID (a mirror as the simplest implementation) simply protects you from one or more disks failing. It will not protect you from data corruption that occurs in your RAM or in your filesystem and gets stored there with total accoracy, because the medium doesn’t know anything about the consistency of your >content< (data), it simply needs to know if it stored whatever it has been fed accurately, whether that was an unreadable file or not.

The takeaway: if you value your content, your data, your precious music that you spent weeks and months ripping and tagging and storing [I’m speaking about myself!] then your Roon setup will need to run the core on a server/ NUC with ECC DIMMs (error -correction-) and with something like ZFS as the filesystem ( error -correction-) with two or more physical drives (the RAID part of the equation - drive failure resilience-).

3 Likes

My friend who is using Intel NUC7i7 ROCK with the same SSD asked me today if I could check his disk as well. Manufacture date is 2021.02. First thing I did was to plug in the SSD into my Windows 11 PC and started the Samsung Magician application. First thing I noticed was high number of Uncorrectable Error Count and ECC Error Rate, see below picture.

Then, I started Ubuntu Linux to check the disk further:

He has a backup, but I said the can have my new un-used Samsung 870 EVO 4TB drive while he RMA his drive. My new drive is manufactured in 2021.11, but I wanted to check and update the firmware on the SSD’s, so I connected the old and new Samsung 4TB SSD to the Windows 11 PC and using Samsung Magician.

Firmware version on the old and new SSD drive was SVT01B6Q and there was update to both of the drives. Newest firmware version is SVT02B6Q. Both disks where updated:

New 4TB disk was formatted as EXT4 and named “ROONSTORAGE”, same as the old one.
I’m now currently copying the files from the old to the new SSD drive in Ubuntu Linux, 440MB/s.
I also noticed that bad sector count was 657 in Ubuntu before I did anything, now it’s 738. This means probably that if the disk is bad already, firmware upgrade won’t help in this case…

Will let you now if there are any bad sectors on the new SSD after the copy, it’s around 3,5TB to write to the new SSD :crazy_face:

EDIT:
Before I started the transfer to the new disk, I tried to copy a few folders from the ROCK via Windows SMB and got copy errors on a few files. Tried to play them on Roon, could play maybe 20-30 seconds, then the song stopped.

However, from your Wikipedia link on cosmic radiaton:

All in all, the error rates as observed by a CERN study on silent corruption are far higher than one in every 10^16 bits

10^16 bits, however, are 1136.87 Terabytes. So based on this, it can happen and probably will happen to most people for one file once in their life or something. (Personally, I have used computers privately and professionally for 35 years, and - without rigorous checksumming - have never encountered a file where I noticed a corruption.)

Oh, I have had corrupted image files - photoshop files of scanned negatives several hundred MB to 1GB in size each - time and again “going bad”, whereby a filesystem scan (NTFS) reported absolutely no errors, but bit flips resulted in rows of the images with reversed color and other image errors.

In large image files any corruption is easily detectable - it’s imediately visible, literally. The file still opens, but it has a problem. If the data corruption were to be in a video file, or in a binary (executable), then maybe you’d not be able to access the file at all.

The only way that stopped for me was switching to ECC DIMMs and an error-correcting filesystem (on my NAS).

The likelyhood of a data getting corrupted, the extent to which it gets corrupted, and whether you care or not, that all depends your individual circumstances.

For me - my music collection and my photo collection are precious, so… spending a few more bucks to take that line of failure out of the picture was a no-brainer.

Our experiences differ, and yours seems to be way over the likelihood stated in the articles, don’t know why :man_shrugging:

Not at all, totally normal occurance, with modern hardware with such a high capacity.
Just look for articles about bitflips and bitrot and soft errors.

To quote that first article: “Research has shown that a computer with 4GB of memory has a 96% chance of having a random “bit flip” every three days.”

Folks, that’s what ECC Memory is there to protect from.

The list goes on.

BTW - I’ve never been in a car accident, but I do wear my seatbelt every day. Look at the statistics - depending on where I live, where I drive, how I drive, and when I drive, it’s more or less likely that I am involved in an accident. I still wear the seatbelt, and my car will attempt to stop automatically if it detects an oncoming crash, and there are various safety and protective systems for the event, etc. All unused. (May it stay so).

And the point of that is, in the event of a crash, which is a bad thing, you want to be as protected as you can.

And maybe you want to do the same thing to your data and protect that, or maybe not. :wink:

For a very interesting explanation of this, there is a great video by Veritasium entitled The Universe is Hostile to Computers.

2 Likes

Can’t this be overruled by the OS like in the old days with defragmentation for example?

No over provisional is handled by the controller as it is a low level feature that moved blocks around to help wear levelling when blocks might have a problem starting to read or write. The OS does not know it exists.

The amount of over provisioning also impacts prices as you are paying for memory that is not available for use.

1 Like

FYI, all

The SSD disk in the Nucleus or ROCK is formatted as EXT4 filesystem, well known filesystem used by Linux distros for years. Defragmentation rarely required.

The default filesystem in Ubuntu, ext4 (and until recently, ext3) are designed to limit fragmentation of files as far as possible. When writing files, it tries to keep the blocks used sequential or close together. This renders defragmentation effectively unnecessary.

The main goal is to keep the SSD healthy, is not to overfill the SSD to 90 or 95% used capacity, maximum 85% I would say. This will give the ext4 filesystem some “breathing” if you understand what I mean :grinning:

3 Likes

Fragmentation happens on ssd’s, and it doesn’t matter. Wear leveling on the ssd will purposely fragment the data for drive health reasons. So the fact that EXT4 prevents fragmentation is silly… They may be referring to trying to keep blocks of data for files together. But disk fragmentation as a whole is not relevant with SSD’s.

I always figured drive size as 20% less then what it say you have. SSD and HDD do not like to be near full capacity… If you’re at that point you should’ve been thinking of a bigger drive or an additional drive when you hit the 80% full point.

Another thing with nvme drives. They all post Error Log Entries. These entries should not be used or related to drive failure.