Friday, March 26, 2021

My ideal Elastic server

Over and over, the IT crowd at the places I have worked have insisted on virtualizing every server, and over and over, Elasticsearch falls on its knees due to network latency accessing SAN storage from their VMs.  I complain about the latency, and I hear that they'll switch me to the newest, fastest SAN technology, and that should "solve all my problems."®

IT likes to virtualize things because they can cram many more virtual servers into high-powered hardware when compared to standing up a physical server for each function.  However, that argument falls apart as the virtual servers become bigger and consume the majority or all of a hypervisor host's resources.  At the extreme, the hypervisor becomes just an abstraction layer and adds little value except the ability to migrate the guest VM.  If anything, the additional layers and complexity seem to increase the occurrances of outages and service degredation.  Plus, to increase the CPU core count, the cores' speed drops.

Elasticsearch is designed to run well on commodity hardware with no component-level redundancy.  While it is experiencing no faults, it is able to fully utilize its hardware by spreading the load amongst all the working nodes that have access to shards of the data being searched.  If a failure occurs, it still works, although performance is degraded; first because there is less hardware to spread a search across, and second because there is likely to be some action in progress to restore redundancy and survivability kicked off as a result of the failure.  This is the core of elasticsearch's node-level redundancy, in contrast to the component-level redundancy favored by most enterprise IT groups.

But let's face it; modern hardware is very reliable, and failures are rare.  Most failures occur because of user error or data center issues such as a power interruption or network faults external to the server.  Or fire or water damage...

Given all these issues, I wish I could convince IT that we're better off with physical servers for our Elastic stack.  Given that high availability clusters start at three nodes do avoid split-brain, here's what I envision as the perfect hardware for that task.  Hopefully the design is dense enough that IT wouldn't have anything to complain about.

I want a 1RU rack-mount chassis with hot-swappable slots on the front for three 1/3rd-width physical node trays,.  Each node tray would provide:

  • A mainboard with a mid-level AMD EPYC CPU (max 180W?)
  • x8 RDIMM slots (or 12?), enough for 512GB if using 64GB per
  • All the usual mainboard features, like TPM, southbridge,  etc.
  • A modular storage cage that comes in multiple flavors, each one with front-accessible USB/VGA ports and power/reset buttons, 2x internal M.2 SSDs similar to Dell's BOSS, and one of:
    • 1x hot-plug LFF SAS/SATA HDD (M.2s under HD)
    • 2x hot-plug SFF SAS/SATA HDDs/SSDs
    • 2x hot-plug U.2 SFF NVMe SSDs
    • 4x hot-plug E3.S/L EDSFF NVMe SSDs

The backplane would connect the three front trays and the back side of chassis which would provide:

  • central lights-out mgmt (eg IPMI) for the chassis/nodes
  • 6x hot-plug OCP 3.0 SFF NIC slots, 2x per node
  • 6x hot-plug E3.S/L EDSFF NVMe slots, 2x per node
  • two or three hot-plug (n+1 redundant) PSUs
The storage controller can either be on the mainboard with SAS/SATA/NVMe cables to the drives, or on the storage cage itself, connected to a PCIe slot on the mainboard.  Being on the cage would mean swapping between SAS/SATA and NVMe is one FRU rather than having to replace both the cage and a controller on the mainboard.  In any case, RAID is not strictly required, as the root device can use software RAID if you really want it, but remember that these are designed for node-level rather than component-level redundancy.  I generally favor a mirror for the OS disks (and redundant PSUs), even with node redundancy.

Regarding compute density, you're looking at a total of three sockets per chassis, with up to 32 cores each at 2.5GHz assuming a maximum TDP of 180W as I write this.  At 96 2.5GHz cores per RU, or 192 threads, that's better than many dual-socket 1RU systems currently out there.  With multiple sockets, you can keep both high core count and high core speed.  You have up to 3 LFF per RU, not far off the usual 4, or up to 6 SFF per RU, a good portion of the usual 8 or 10.  And that does not count the additional internal M.2 drives or the rear EDSFF slots.  With NVMe and EDSFF, there would be 6 E3.S/L per node with 4 up front and 2 in back, and 18 is close to the usual max of 20 per RU.  With thermal limits on the CPU, and no additional room for expansion cards, these would not be heat monsters, so 120VAC PSUs in the 1100W range should suffice.  It may even be possible to modularize the back EDSFF slots and have an optional cage that offers 3x LP PCIe slots instead of 6x EDSFF.

To sum it up, a chassis with 1/3rd-width nodes could scale from a single cluster of three servers in a single RU up to a huge farm running three or more racks with nearly 200 nodes per rack.  It's economical in that it doesn't require hypervisor licensing, oodles of redundant disk, or exotic network gear.  It is not even specific to elasticsearch, so could be used for other high-density server use cases featuring node-level redundancy, but with a count of 3 nodes per RU, hits what I believe to be a sweet spot between high compute density and still having enough physical space for meaningful per-node local storage.  With the option for a LFF HDD, the current and upcoming crop of 20+ TB drives have a place for use as cold storage servers.  

Not all systems require a TB of memory an 128 cores; there is still a place for small and dense physical servers.  Dell, make it so!