Random thoughts on Supermicro Risers, LSI LBAs, power supplies, ZFS and other fun server things (#P34)
With the failed 450W SFX power supply separated into a regular post last week, now’s the time to waffle on about what happened besides that.
Well, Solaris 11.4 (evil developer/home use licensing, so no current service packs) still uses minimal ashift as default instead of a fixed value of 12. ashift is the exponent of the power of 2 that is used as smallest assignable portion of data, so basically sector size. “Old” default is 9, as 2^9 is the well known 512 byte sector size that most old hard disks had as both logical and physical size, and which they reported as such. Interestingly the ST3000NM0043 drives that I migrated to have a user-defined sector size – I got mine with 512B formatting, but 516B, 520B, 524B and 528B are apparently also possible (the datasheet contradicts itself in the details). Those do not preserve total sector number, so the drives do not have “large” physical sectors which are partially wasted for ordinary 512B systems, but rather the surface can be reorganized to have the regular amount of 512B sectors, or a proportionally smaller number of e.g. 528B sectors. Switching sector sizes will void all data – no surprise here.
Now, ashift was a bit of a hot topic when 4K sector hard drives with 512B emulation first came to market, as some of them lied. They fucking lied. Despite HORRIBLE performance penalties when trashing them with 512B sync writes beyond their already abysmal mechanical performance, some manufacturers decided Microsoft OSes cannot deal properly with devices truthfully announcing 4K physical and 512B logical sector sizes, so they let them report 512B physical and 512B logical sector sizes. That’s when forcing ashift to 12 in ZFS became a thing, because not only some drives needed to be placed at ashift 12 despite their own affirmation 9 will do the job, but also one can not place large sector size drives in small sector size arrays (performance hit all aside), and ashift setting of a zpool cannot be changed later. So if one creates an ashift 9 pool today and ten years down the line one drive fails and needs to be replaced, one would be forced to find a 512B (native) drive or migrate the entire pool in degraded state.
With my old pool and a mix of 512B emulated and 512B native sector drives with truthful reporting of their physical sector size that wasn’t a thing (but I did check…), but the ST3000NM0043 do report 512B native size – unknown if correct or not, especially in the user-selectable size regard. I wanted a future-proof array, so I needed ashift 12. Solaris does not offer a “force ashift” command when creating zpools like apparently OmniOS /OpenIndiana/OpenZFS does nowadays. So it needed to be tricked into doing that.
/kernel/drv/sd.conf is the place to do that. It might be mostly empty by default, but can be edited to make the necessary ashift changes down the command chain. All one needs is the vendor and product description, e.g. my drives showed up like this:
Vendor: ATA ,Product: ST2000DM001-9YN1 ,Revision: CC62 ,Serial No: Z1E0Fxyz Vendor: ATA ,Product: TOSHIBA DT01ACA2 ,Revision: ABB0 ,Serial No: 64UTWZxyz Vendor: IBM-ESXS ,Product: ST3000NM0043 E ,Revision: EC5L ,Serial No: (none, surprisingly)
So the vendor has a fixed 8 character length (only present for the SAS drive here), and the product code has 16 chars. In sd.conf, that translates to
sd-config-list = "IBM-ESXSST3000NM0043 E", "physical-block-size:4096";
(vendor and product are packed together without spacing)
and after that is saved, running
sudo update_drv -vf sd
(verbose force-reading the config file) will compile those changes for further use. zpool create used on one of the matching drive types will then use physical block size 4096 so ashift 12 for pools if no larger ashift has been found elsewhere. Apparently people are using 13 = 8192 sector size as well, but I don’t see that becoming a trend in the near future.
After creating the pool, check with
zdb -C | grep ashift
My rpool is on ashift 9 but that one can be recreated any time – all data pools are on 12.
So that is one important step in pool creation, the other being activation of compression and encryption on pool level instead of filesystem level. The first one matters for transferring snaps (first snap would be uncompressed), the latter being a convenience thing as one password unlocks everything (of course the need for several different passwords forbids that flag). Enabling those is straight forward, just add the -O flags encryption=on and compression=lz4 to the zpool command, e.g.
sudo zpool create -O encryption=on -O compression=lz4 wanhunglo raidz2 c12longstring c13bla c14foo c15bar c16lol and so on
Now over to some hardware. As mentioned a thousand times already, I’m running a Supermicro X9SRW, so a single-processor board with somewhat limited amount of PCIe lanes (x32 with bifurcation on the left hand side riser) in a proprietary form factor. Dual processor boards such as the X9DRW series offer a few more lanes when both processors are present. So those use the same Supermicro WIO risers, but there’s a special PCIe 3.0 x48 riser board available – the RSC-R2UW+-2E16-2E8. And as I happened to have an eBay search running against these and all other WIO risers (>50€ new!), I recently found one, in Germany.
And missed bidding in time. And nobody else placed a bid on it, so the seller offered it once again.
And I fucking missed it again.
Since, again, nobody placed even a single Euro as a bid, it went on a third auction, where I finally won with 1.00€ final bid
So here’s the mighty PCIe x48 (2x x16 + 2x x8) riser:
I wasn’t completely sure about the lane situation, but it was clear that the x16 slots were not split between the CPUs. So the additional 16 lanes from the “SLOT1” piece on the back could have gone into two x8 or one x16 slot (the others being rearranged accordingly), but not like enhancing the quad-x8 card RSC-R2UW-4E8 (overview page here) by sticking on two x8 parts to existing x8 slots. Luckily, it’s pretty obvious how the lanes are handled, so the addition from CPU1 goes all up to the top x16 slot and the others are connected to CPU0.
This effectively offers a new type of WIO card for single CPU boards that can hold the oversized riser – lanes as present on the riser from top to bottom:
RSC-R2UW-2E8E16 -> x16, empty, x8*, x8* (not 100% sure, judging from bad photos) RSC-R2UW-2E8E16+ -> x16, x8*, x8, empty RSC-W2-66 -> x16, empty, x16, empty RSC-R2UW-4E8 -> x8, x8, x8, x8 RSC-R2UW+-2E16-2E8 -> empty, x8, x16, x8
(* offer x8 in physical x16 slots)
RSC-W2-66G4 is equal to RSC-W2-66 and RSC-W2-8888G4 is equal to RSC-R2UW-4E8, both offering PCIe 4.0 speeds instead of 3.0. There are some 2U risers with fewer PCIe slots that are not limited by even a single CPU. All 1U risers only offer two slots maximum, so they cannot be limited by lanes at all. The standard RSC-R1UW-2E16 with 2x x16 is probably best for >99% of all systems and only very specific builds need downgraded risers.
Now let me pull out the photo from the former Sandy/Ivy comparison shots to demonstrate the location of the riser slots:
Bottom of the PCB, from the very left to about the center of the board. When looking at it closely, one can see that the second PCIe-type slot on the board is positioned so that any card in it will clear the adjacent top white SATA (6Gb/s) port with virtually no spacing, and the other black SATA ports with 3Gb/s are in line. If one uses SATA cables without the metal clip that would face upwards, so against the card (or none of those ports at all)…that sweet RSC-R2UW+-2E16-2E8 will actually fit the X9SRW.
Given my other RSC-R2UW-4E8 riser is somewhat stuck with the mounting situation as described in the 836 chassis conversion project #P23, that additional super-cheap riser will allow for testing things without disassembly of the card stack. Which, by the way, has been adapted for some air flow requirements of the new RAID cards, but that’ll be a separate blog post since my current fix of adding a kitchen sponge is not suitable for permanent use
The “300Pcs M3 Brass Hex Column Standoff Support Spacer Screw Nut Assortment Kit” for 9.39€ (bought March 2020 on AliExpress) was very helpful in getting those cards aligned and fixed in place, since it offers a nice selection of 4/6/8/10/12mm standoffs (m2m, m2f). I only wish these were available in some Nylon material as well.
With another 90° riser stuck to that riser, mounting cards with their original full or low profile I/O bracket is possible – that’s a big plus compared to the former situation. I’m aware that those photos are a bit…crowded.
Last, but not least, a few words on the new RAID cards: I have swapped my two IBM BR10i (SAS1068e cards, so 8x SAS1 at PCIe 1.0 and 2TiB disk limit) with the two Dell PERC H310 cards shown above (SAS2008 = SAS2, PCIe 2.0), as I was obviously breaking the 2.2TB/2.0TiB limit. These do have a nicer alignment for the SFF-8087 cables, were better available than the Dell H200 with the vertical ports, and cheaper than the more widely known IBM M1015 – all of those are the same chipset. A 16x card would be perfect but those are dang expensive, probably because only LSI branded ones are available and not some IBM, Dell or HP that are thrown out of standard servers by the dozen. I do have the space to run 2×8, see above, so I just went with that option. The servethehome comparison list was very helpful in deciding for a card.
I cannot really link the software required to flash those cards to IT mode since I cannot verify the files are 100% clean, but I bet you can find the ZIP yourself on the interwebs. sas2flash version P5 was required, since later versions are either buggy or do not accept non-LSI branded cards to be flashed. Mine being Dells were affected, so I had to use P5 to change manufacturer code, and then patch up to P20 (there are multiple subrevisions and the first is buggy as well, according to people on said interwebs). Changing to IT mode = HBA mode with disabled RAID functionality for ZFS use is just a matter of flashing a different file. Don’t forget to pull SAS addresses when doing some “non-recommended” flashing with the -f option as they will get lost in the process. I didn’t as both sas2flash and megacli didn’t want to show them (megacli.exe -AdpAllInfo -aAll -page 20 ?) and I was too lazy to boot into Solaris again, so I just added some fake controller SAS addresses later on – doesn’t really matter, since there’s no sticker on the hardware indicating otherwise. They just need to be unique in the system, so run sas2flash -o -sasadd 5xxxxxxxxxxxxxxx with the controller parameter -c for each of the present cards (and don’t fuck up compatible expanders found by the tool!). Hardest part in the whole story was getting the board to boot with the USB key and all necessary files in distinguishable 8.3 format on it…
I’ll add another post about cooling these, as they do get significantly warmer than the old cards even in idle and according to servethehome will actually die from lack of airflow. I myself experienced a very unresponsive Windows a few minutes after boot for the initial nulling of the disks, so they might throttle, but just not enough to survive long-term.
Well, that should be it – data transferred successfully, even though a blown desktop mainboard capacitor, a blown server power supply, a cracked laptop hinge, and, earlier today, a pinhole leak in the 18mm warm water supply pipe just before the gate valve tried stopping me. Hope those all get fixed soon…