What is your test method for new HDDs before adding to array?

RileyKennels@alien.top · 1 year ago

What is your test method for new HDDs before adding to array?

EchoGecko795@alien.top · 1 year ago

Here is my over the top method.

++++++++++++++++++++++++++++++++++++++++++++++++++++

My Testing methodology

This is something I developed to stress both new and used drives so that if there are any issues they will appear.
Testing can take anywhere from 4-7 days depending on hardware. I have a dedicated testing server setup.

I use a server with ECC RAM installed, but if your RAM has been tested with MemTest86+ then your are probably fine.

SMART Test, check stats

smartctl -i /dev/sdxx

smartctl -A /dev/sdxx

smartctl -t long /dev/sdxx

BadBlocks -This is a complete write and read test, will destroy all data on the drive

badblocks -b 4096 -c 65535 -wsv /dev/sdxx > $disk.log

Real world surface testing, Format to ZFS -Yes you want compression on, I have found checksum errors, that having compression off would have missed. (I noticed it completely by accident. I had a drive that would produce checksum errors when it was in a pool. So I pulled and ran my test without compression on. It passed just fine. I would put it back into the pool and errors would appear again. The pool had compression on. So I pulled the drive re ran my test with compression on. And checksum errors. I have asked about. No one knows why this happens but it does. This may have been a bug in early versions of ZOL that is no longer present.)

zpool create -f -o ashift=12 -O logbias=throughput -O compress=lz4 -O dedup=off -O atime=off -O xattr=sa TESTR001 /dev/sdxx

zpool export TESTR001

sudo zpool import -d /dev/disk/by-id TESTR001

sudo chmod -R ugo+rw /TESTR001

Fill Test using F3 + 5) ZFS Scrub to check any Read, Write, Checksum errors.

sudo f3write /TESTR001 && f3read /TESTR001 && zpool scrub TESTR001

If everything passes, drive goes into my good pile, if something fails, I contact the seller, to get a partial refund for the drive or a return label to send it back. I record the wwn numbers and serial of each drive, and a copy of any test notes

8TB wwn-0x5000cca03bac1768 -Failed, 26 -Read errors, non recoverable, drive is unsafe to use.

8TB wwn-0x5000cca03bd38ca8 -Failed, CheckSum Errors, possible recoverable, drive use is not recommend.

++++++++++++++++++++++++++++++++++++++++++++++++++++

MakingMoneyIsMe@alien.top · 1 year ago

I don’t test. I just slap it in.

Celcius_87@alien.top · 1 year ago

Run the health check in hdtune and then also look at smart stats. Also check drive health using WD Dashboard software. Then I put into work and observe it.

Xenkath@alien.top · 1 year ago

Long SMART test, dd if=/dev/urandom of=/dev/[new disk], long smart test. That’s pretty much it.

EtherMan@alien.top · 1 year ago

Don’t need to. I use ceph with replication, so the drive can die after 5 mins and it wouldn’t matter at all to the array. I just put a sticker on it that tells me when it’s bought so I can RMA it if it dies quickly. Beyond that, I can basically lose two thirds of my drives and still be perfectly fine, or I could lose data from 3 drives failing if I’m unlucky and it’s from three different servers, one of which is off site. And that loss has to be within like a couple of hours between them or it’ll have time to create new replicas of it, and it will only be a small subset of the data that has priority once the second drive fails as it will only need to create new replicas of the data that only exist on those two drives which isn’t going to be more than a couple of hundred megs really.

Basically, I love Ceph if that doesn’t show ;P