The strange formatting of Oracle-branded HGST SSDs (WHL #71)
A little tale from the world of enterprise computing (again). Well, I needed something to do alongside filing my taxes, and I recently received a bunch of hardware, so there’s that.
I ordered, after long and unfruitful discussion and some eBay detour that also didn’t work because of, well, eBay, a couple of replacement hard disks of the spinning rust type, as well as SSDs to support them. The original ZFS RAID volume that kept *ALL* my data crashed via some obscure file system bug, and since I’m not a paying Oracle customer (not that they would help me out – everything is plastered with “restore from backup” notes), I finally migrated to Ubuntu and OpenZFS.
Back when I started with ZFS, Oracle was clearly ahead, but nowadays file-system encryption without any additional inner or outer layer such as geli is present in OpenZFS. They also got compression and even persistent L2ARCs now, the latter isn’t possible with Oracle ZFS. Okay, those folks aim for 100% uptime so that’s no priority, but this just shows that open source works well for a broad audience since it’s not necessarily driven by the needs of commercial customers. Now’s a great time to address a sincere “fuck you” to Oracle for buying Sun Microsystems and doing Oracle things to them – squeezing out cash from existing customers until they migrate away and letting software rot that is not useful to them. Solaris, ZFS, Java, MySQL, VirtualBox, even the SPARC architecture, you name it.
Rant aside: I tested the HDDs and SSDs by a couple of read and write cycles in a dedicated Windows environment, checked SMART data after running self-tests (HDSentinel works fine but gsmartcontrol only has blank pages) and all seemed fine. When I switched to Ubuntu for formatting for a couple more tests, this happened:
kernel: [ 332.693627] mpt2sas_cm0: log_info(0x3112043b): originator(PL), code(0x12), sub_code(0x043b) kernel: [ 332.693644] scsi_io_completion_action: 30 callbacks suppressed kernel: [ 332.693652] sd 1:0:1:0: [sdb] tag#3170 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_SENSE cmd_age=0s kernel: [ 332.693659] sd 1:0:1:0: [sdb] tag#3170 Sense Key : Illegal Request [current] kernel: [ 332.693663] sd 1:0:1:0: [sdb] tag#3170 Add. Sense: Logical block reference tag check failed kernel: [ 332.693667] sd 1:0:1:0: [sdb] tag#3170 CDB: Read(32) kernel: [ 332.693671] sd 1:0:1:0: [sdb] tag#3170 CDB: 7f 00 00 00 00 00 00 18 00 09 20 00 00 00 00 01 kernel: [ 332.693674] sd 1:0:1:0: [sdb] tag#3170 CDB: 5d 50 a3 00 5d 50 a3 00 00 00 00 00 00 00 00 08 kernel: [ 332.694391] blk_update_request: protection error, dev sdb, sector 5860532992 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 kernel: [ 332.694401] buffer_io_error: 10 callbacks suppressed kernel: [ 332.694403] Buffer I/O error on dev sdb, logical block 732566624, async page read kernel: [ 332.695235] mpt2sas_cm0: log_info(0x3112043b): originator(PL), code(0x12), sub_code(0x043b)
Tons and tons of these errors, many on the exact same blocks across ALL devices. dd writes are working, dd reads completely fail. gparted formatting is completely broken. Back in Windows, everything is alright.
Are those SSDs broken or do they run a custom Oracle firmware that doesn’t work well with Linux?
Here’s the trick. The key is already present in the error messages, so bonus points for reading comprehension. It’s a protection error – something new to me, but googling the interwebs yields an easy solution. For SSDs, that’s also a quick fix, for HDDs it takes an entire write cycle, so several hours up to a day or so.
This is the
smartctl output from one of those disks:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.11.0-38-generic] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HGST Product: HBCAC2DH6SUN200G Revision: A170 Compliance: SPC-4 User Capacity: 200.049.647.616 bytes [200 GB] Logical block size: 512 bytes Physical block size: 4096 bytes Formatted with type 1 protection 8 bytes of protection information per logical block LU is resource provisioned, LBPRZ=1 Rotation Rate: Solid State Device Form Factor: 2.5 inches Logical Unit id: 0x5000cca09b03bf30 Serial number: 001837J31W8X 70V21W8X Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Oct 27 19:33:31 2021 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled
Stands out like a sore thumb!
Here’s how it likely should look like:
=== START OF INFORMATION SECTION === Vendor: HGST Product: HBCAC2DH6SUN200G Revision: A170 Compliance: SPC-4 User Capacity: 200.049.647.616 bytes [200 GB] Logical block size: 4096 bytes LU is resource provisioned, LBPRZ=1 Rotation Rate: Solid State Device Form Factor: 2.5 inches Logical Unit id: 0x5000cca09b03bf30 Serial number: 001837J31W8X 70V21W8X Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Wed Oct 27 19:50:49 2021 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled
(LBPRZ=1 seems to be a deterministic “read zeroes after TRIM” feature)
“type 1 protection” has to go. While certainly useful in addition to all ZFS checksumming, the extra 8 bytes of ECC after each logical 512 byte sector (which I do not even use) have to be supported by software. It seems that most tools do, including ZFS which I tested before finding the solution, but e.g. gparted doesn’t. When there’s a checksum mismatch the thing suddenly throws errors and goes read-only. That’s no good…
sg3_utils is the first step to solve this.
sg_readcap --long /dev/sdb yields this:
Read Capacity results: Protection: prot_en=1, p_type=0, p_i_exponent=0 [type 1 protection] Logical block provisioning: lbpme=1, lbprz=1 Last LBA=390721967 (0x1749f1af), Number of logical blocks=390721968 Logical block length=512 bytes Logical blocks per physical block exponent=3 [so physical block length=4096 bytes] Lowest aligned LBA=0 Hence: Device size: 200049647616 bytes, 190782.2 MiB, 200.05 GB
sg_format --format --size=4096 --pfu=0 /dev/sdb and querying again yields this:
Read Capacity results: Protection: prot_en=0, p_type=0, p_i_exponent=0 Logical block provisioning: lbpme=1, lbprz=1 Last LBA=48840245 (0x2e93e35), Number of logical blocks=48840246 Logical block length=4096 bytes Logical blocks per physical block exponent=0 Lowest aligned LBA=0 Hence: Device size: 200049647616 bytes, 190782.2 MiB, 200.05 GB
Note that 4096b logical sectors need to be supported by the drive – these do, so I just went with logical = physical size,
which also forces ZFS to use ashift=12 for my otherwise 512b native pool no it fucking doesn’t and now I got an ashift=9 pool . Any drive however should either be capable of using 512b (for reformatting oddball 520b and 528b drives) or 4096b = 4K sectors.
Also note that not using the 8b per 512b checksum area does not increase total number of LBAs, so no additional user space. It likely is available to the controller, but those HGST drives are allegedly overprovisoned by 312% already, so that doesn’t matter all that much. 200GB user data from 768GiB of flash. Obscene.
These benchmarks indicate that there might be a little bit of spare NAND inside of these beasts – this is the empty drive:
And this is the same thing with just 10GB free (software limitation) aside from the test data, which is where other drives totally nosedive in performance:
Crystal Disk Mark basically says the same and is also not caching anything, so 1GB tests are exactly the same as 16GB tests (or the cache is larger than those 16GB of test data), 5 times each. TRIM is not active here for controller/driver issues!
And that’s about it. When firing the tiny zpool create command every device is mentioned in the syslog, but no error to be seen. I’ve also checked after fully copying the entire backup pool back to the now main pool – not a single error. Here’s the little rascal:
zpool create -o ashift=12 -O compression=lz4 -O encryption=aes-256-gcm -O keylocation=prompt -O keyformat=passphrase mylittleraid raidz2 /dev/disk/by-id/wwn-0x5000c5008488b8db /dev/disk/by-id/wwn-0x5000c50084114cc7 /dev/disk/by-id/wwn-0x5000c50084890d07 /dev/disk/by-id/wwn-0x5000c500847143ff /dev/disk/by-id/wwn-0x5000c500847945e7 /dev/disk/by-id/wwn-0x5000c500a664d77f /dev/disk/by-id/wwn-0x5000c5008410ffd7 /dev/disk/by-id/wwn-0x5000c500846ebfbb /dev/disk/by-id/wwn-0x5000c5009465bd23 /dev/disk/by-id/wwn-0x5000c500864728e3 special mirror /dev/disk/by-id/wwn-0x5000cca09b03bf30 /dev/disk/by-id/wwn-0x5000cca09b040e80 cache /dev/disk/by-id/wwn-0x5000cca09b02d478
This yields the following pool:
raidz2-0 ONLINE 0 0 0 wwn-0x5000c5008488b8db ONLINE 0 0 0 wwn-0x5000c50084114cc7 ONLINE 0 0 0 wwn-0x5000c50084890d07 ONLINE 0 0 0 wwn-0x5000c500847143ff ONLINE 0 0 0 wwn-0x5000c500847945e7 ONLINE 0 0 0 wwn-0x5000c500a664d77f ONLINE 0 0 0 wwn-0x5000c5008410ffd7 ONLINE 0 0 0 wwn-0x5000c500846ebfbb ONLINE 0 0 0 wwn-0x5000c5009465bd23 ONLINE 0 0 0 wwn-0x5000c500864728e3 ONLINE 0 0 0 special mirror-1 ONLINE 0 0 0 wwn-0x5000cca09b03bf30 ONLINE 0 0 0 wwn-0x5000cca09b040e80 ONLINE 0 0 0 cache wwn-0x5000cca09b02d478 ONLINE 0 0 0
10-disk RAIDZ2 so conforming to the old 2^n + 2 formula for Z2 pools for when lz4 compression doesn’t work, plus a mirror of special devices and a single cache. The mirror needs the -f parameter as it doesn’t match main pool Z2 redundancy, and (at least) a mirror is necessary here because a missing special device will take down the pool. A missing cache device however will just cause the pool to read from rust instead of solid state. I believe a Z2 or triple-mirror isn’t worth the effort since mirroring 25GB of data on a flash drive (see below) doesn’t take all that much time compared to replacing an entire TB of data on a classic hard disk. Therefore, unless they both die within an hour, I should be fine. If they do, chances are a third or fourth drive would also be dead because of a power surge or, you know, atmospheric interference.
Quick word on performance of this: While of course it is an unfair comparison, the 4-disk Z1 backup compared to the 10-disk Z2 + special + cache is slightly at a disadvantage.
Opening the property window of dolphin locally and querying the entire ~400k files / 6TiB (without redundancy) takes 95 seconds on the Z1 pool. It takes 9s on the Z2.
zdb -Lbbbs poolname (which also traverses all file metadata for debugging purposes) takes 168 seconds vs. 45s. Not a 10-fold decrease but still a whopping performance boost. This command also shows used capacity on the special devices which is increased here due to my pool-wide setting of
special_small_blocks=64K. Any blocks smaller than that, and these make up around 260k of the total 400k files, are directly written to the special device for faster access. This is basically tiered storage which is also not possible with the regular “free” Oracle ZFS and another big reason I migrated my data to OpenZFS. By tuning the
special_small_blocks per ZFS instead on a pool level this allows finer control over the data that is sent to SSD instead of HDD, up to the case where everything is on SSD until they’re almost filled. Here, those 260k files do not take up all that much of the 200GB/186GiB available, so that’s probably the optimal setting until I’m able to increase the general
recordsize=128K from its default value.
capacity operations bandwidth ---- errors ---- description used avail read write read write read write cksum mylittleraid 8.28T 19.2T 3.46K 0 35.1M 0 0 0 0 raidz2 8.26T 19.0T 1 0 115K 0 0 0 0 mirror (special) 24.0G 162G 3.46K 0 35.0M 0 0 0 0
Bonus pictures from the drive:
Couple of questions and remarks already:
1) This is the smallest capacity drive of the series, so no wonder only half of the NAND channels are even populated. Strangely, it’s 2x 3 packages (top and bottom), is there a 6- or 12-channel controller out there?
I need to have suitable silpads ready if I want to take off the brittle gunk for checking the NAND type, so that I can put it back to service afterwards. They’ll likely survive without but look at the effort these guys have put into heatsinking. The case is two pieces die cast alloy and they weigh a ton, further making this 15mm height SSD an absolute tank against regular SSDs.
Note: No need to chicken out, since I had a spare razor blade from CPU delidding around, I just cut it away a bit and moved the flap back down afterwards with a bit of paste. Hehe.
So, it’s 29F01T2ANCMG4 NAND. While I have no Intel datasheet about this (morons!), everything lines up and this should be a 1Tib/128GiB chip, likely MLC but not 100% sure (the SSD would have a different model number, see below, plus TLC might not be a “MG4” part). Excluding the remote possibility that there could be a mix of NAND present, 6x 128GiB = 768GiB, ridiculous overprovisioning figure confirmed.
Fully populated this would be 1.5TiB, and populated with the 2Tb/256GiB sister model 29F02T2AOCMG4, it’d be 3.0TiB. Now there’s also a strange 29F04T2AWCMG4 part which makes a 720GB SSD (wtf?) with just one chip populated or a 2TB SSD with three of them on board (even more wtf), so I would say this is not a 4TiB/512GiB chip but rather a 3TiB/384GiB one thanks to 3D stacking. Using that part x 12 the maximum would be 4.5TiB for this board layout, enough to fit the largest unit of the speculated origin series of that drive – see below.
3) The power-loss protection cap is actually a THT part and not an array of parallel SMD caps. It’s a 1000µF 35V 105°C from Chemicon, so that’ll do…
4) There’s a SK Hynix DRAM chip, H5AN4G8NAFR. 4Gb = 512MB of DDR4, nice!
5) I’m curious: This is a HGST drive, so Western Digital. Why on earth is there an Intel logo present!? Just because Intel NAND is used?
6) As for what this actually is…I think this could be a relabelled Ultrastar SS300.
It’s clearly not a Ultrastar 800MM drive, but this is the most recent teardown review of a HGST/WD enterprise product that I have found. The older SSD400S.B, still a Hitachi drive, also does not match at all, and the more recent SS200 has a different case.
There’s a shady press photo of the SS300 that does not match the PCB layout, but I wouldn’t call that convincing evidence. On the other hand, SS300 drives are advertised with this image, which absolutely matches the cooling fins of this SSD:
Bitdeals also has this 800GB SS300 listed for sale, which matches the label and the construction of the upper case with the two black screws on one side:
Cisco lists identical A170 firmware for SS300 drives. There is no 200GB model in that series; that should however not a problem for a company like Oracle with enough purchase volume to get their devices customized. The SS300 capacities of 3200, 1600, 800 and 400 GB would match this type of drive, just with halved NAND (or rather absurd overprovisioning).
The “HGST PN 0B40021” sticker on the bottom also yields not much, but one of the four Google hits is from Walmart, stating the following:
HGST HUSMM3240ASS201 Ultrastar SS300 400GB 2.5″ MLC NAND TCG SAS 12Gb/s SSD Solid State Drive 0B35244.
Part Number: 0B35244
Series/Family: Ultrastar SS300
Compatible Part Numbers: 118000323-02 0B40021
Furthermore, the internal name HBCAC2DH6SUN200G (Oracle PN) is supplemented by another model number only visible on the drive label: HUSMH4020ASS210. This is also speculated upon by a couple of random dudes here, and they’re right about the (somewhat known) model number structure:
HUSMH4020ASS210 is a
* Ultrastar device “HUS”
* with MLC high endurance NAND “MH” – used e.g. on the 800MM drive (25nm MLC NAND)
* in a series with 400GB max “40”
* with capacity 200GB “20”
* first generation of this type “A”
* 2.5″ drive “S”
* dualport SAS 12Gb/s “A2”
* customization number “1”, apparently very common for customized Netapp, Sun, Oracle drives from HGST
* secure erase security option “0” (1 would be TCG, 5 TCG-FIPS, 4 no encryption)
While I cannot find any 400GB drive of that series and no Oracle 400GB drive aside from the older HUSMM1640ASS200 (that maxes out at 1.6TB – clear indicator this series is arbitrarily limited), my money would be on the SS300 series. It’s a shame that one hasn’t been reviewed in the entire interwebs yet…