This post is part 2 of 2, of my NPX Architecture Guide how-to series. In this post we will cover sections 9 through 14, from the outline below. You can check out the first part of this series here. At the end I will also give you some more tips on various standard tables that I used throughout the document.
The major sections in the architecture guide are:
- Overview
- Current State and Operational Assessment
- Design Overview
- Nutanix Capacity and Sizing
- Nutanix Cluster Design
- Host Design
- Network Design
- Storage Design
- Security and Compliance
- Management Components
- Virtual Machine Design
- Data Protection and Recoverability
- Datacenter Infrastructure
- Third-Party Integration
9.0 Security and Compliance
Security is often a weak point for many architects, so be sure to not skimp on this section. If it's light on details, be prepared for defense questions. Topics to cover here include, but are not limited to: RBAC design (Prism/hypervisor/applications); SSL certificates (Prism/NGT/Hypervisor); system hardening (STIG, PCI/DSS, etc.); network security (microsegmentation, VLANs, ACLs, etc.); patching/compliance reporting; use of SSH and hardening (e.g. SSH keys); syslog configuration (TCP/UDP); PulseHD; use of two factor authentication; Nutanix password complexity settings.
And depending on your technical and business requirements, there very well could be additional security areas you need to cover. Have you read the Nutanix security guide? Make sure every "i" is dotted and every "t" crossed.
10.0 Management Components
The control plane for your solution is very important. Don't make management overly complex, as one of the beauties of a Nutanix solution is simplicity. Things to consider here include: Prism configuration; Nutanix patches and AOS upgrades; how to monitor Nutanix, what about OS patches?; Hypervisor patching; what tools are you using to monitor the network?; What about monitoring the hypervisor?; You of course have VMs, so what tools are monitoring them?
In the management arena don't forget about advanced automation tools such as Puppet, Chef, PowerShell, and Nutanix Calm. What are you using for Syslog? Splunk? Are you using Prism Central or Prism Pro? If so, why, or if not, why not?
11.0 Virtual Machine Design
As called out in the NPX blueprint, you must include your virtual machine design. What is that? Well it should cover topics such as VM templates, VM virtual hardware, are you making use of SCSI unmap or not? And what are the implications of using or not using unmap? What's the difference between Linux/Windows unmap? How about VM affinity or anti-affinity rules? What is the lifecycle of your VMs (cradle to grave)? Do you have monster VMs? What is your NUMA boundary and do any VMs cross it? What are the implications of NUMA?
12.0 Data Protection and Recoverability
What good is your solution if it's not protecting your business critical data, and ensure that you can recover from a disaster? Here think about covering: Backup software (Netbackup, Veeam, HYCU, etc.); Nutanix data protection (Protection domains, replication, snapshot frequency, sync/async replication, etc.); network configuration protection (e.g. nightly switch config backups); storage protection; hypervisor control plane backups; how to protect critical infrastructure services like AD/DNS; what is your VM backup frequency?; What are your operational procedures for areas such as change control, patch management, and business continuity?
13.0 Datacenter Infrastructure
To many people the datacenter infrastructure can be a scary topic, and lightly covered in the architecture guide. But it's critical. What if you miscalulate (or don't calculate at all) the heat load for your proposed solution and it melts down in the rack or causes a fire? What does your rack elevation look like? Did you allow space in the rack for logically placing new nodes? What is the datacenter rating for the maximum heat load of an individual rack? What types of PDUs are you using and how many? Did you adjust your estimated amps usage for the powerfactor? Are you running your PDUs at more than 80%, sustained during a failure situation? What is your datacenter facility rated for in terms of downtime a year? And is that downtime planned or unplanned? How many more nodes can you add to the rack before you exceed the rated limits (cooling, power, or weight)? Is your solution going to fall through the floor because you didn't validate your assumption about maximum load rating (did you even ask?)?
14.0 Third-Party Integration
Another scoring area for the NPX is your ability to cover third-party integrations. What does that mean? For the NPX, that's any non-Nutanix product which you include in your solution. I recommend a separate section, even if you have touched on these solutions throughout your guide. Why? Makes finding it easier, and your panelists will like that. The areas you will cover here a highly solution dependent, so you may have fewer, more, and likely different products to cover than I did. For my solution it used Splunk, NetBackup, VMware vSphere, and also VDI.
Sample Design Decision Table
Throughout your architecture guide you absolutely must thoroughly document your major design decisions. How many design decisions you will have totally depends on your solution, and how thorough you want to be. In my case I had 60 design decisions, and each one was captured using the template above. The I placed full design decision table in the appropriate major section of the guide (e.g. networking, security, etc.). At the front of my architecture guide I had another table, consisting of one line per design decision, for easy reference.
Now this design decision table is not "perfect" and in fact I would argue needs supplementation. I could, and should, have done it better. But first let's start with what's in the able, and what I left out which I think you should have in it. First, you need to label and capture a one sentence description of the decision. For example, are you using the VMware standard switch or distributed switch? Next, every single decision has an impact. What is that impact? Describe it. Nearly every decision has a risk....what is it? And every risk needs a mitigation, so what is that?
Now, what do I think you should include that I didn't? "Alternative design decisions". Why do I think you need alternative design decisions for nearly EVERY decision? Because you will likely be asked about it during your defense. For example, let's say Design Decision 40 was to use LACP. Ok that's fine and dandy, but what are the alternative(s) to using LACP and why didn't you use them? Or, what if you chose the NX-3060-G6 node for your baseline node type. What would be an alternative node type that could also work? These are EXACTLY the types of questions you need to be prepared for during your live defense. But thinking about them before AND documenting them in your guide, you are that much closer to successfully defending your NPX design.
So yes, IMHO, I think every single design decision you have should be documented with: Impact, risks, risk mitigation and alternatives.
Sample Assumptions Table
You might be thinking, what's so special about an assumption table? Don't you just capture all of your assumptions and call it a day? NOPE! Epic fail! For each assumption you need to validate, if at all possible, your assumption. Document how you will validate it.
Sample Summary Table
Now this table I think is optional, but I included it for both my VCDX and NPX designs. At the end of each major tech section (e.g. 4.0 - 14.0) I have a section called "Summary and Design Decisions". The summary is the table above, which captures and quickly displays all of the referenced requirements, constraints, assumptions, risks, and design qualities covered in that major section. I think of the table as a 'double checking' that I've covered all of the requirements, constraints, assumptions and risks applicable to that major section. Is this table required? Nope. Do I like it? Yup. Should you use it? Totally up to you.
Additional Architecture Guide Tips
One of the hallmarks for a "X" level (VCDX/NPX) level architecture guide is traceability. What is that? It means for every labeled item (requirements, constraints, assumptions, risks, design decisions) it needs to be called out at least ONCE in the main body of your document. NPX examiners DO use the search functionality quite often to see, for example, if risk RS05 is actually addressed in your document. As you write your guide, and as a final QA, take an afternoon and search for every single labeled item and MAKE DARN SURE it's referenced in the body of your guide.
Another tip that I find exceptionally helpful for starting a new architecture guide is this: First construct the outline of all the areas in the NPX (or VCDX) blueprint as major sections (e.g. network design, storage design, security and compliance, etc.). The under each major heading, just like I have, construct your sub-headings of conceptual, logical and physical. Then at the end of each major section have a design justification summary, and then your summary table and design decision tables. After you do all of this 'pre' work, you will have a nicely outlined guide that you can now start filling in the details. Easy, right?
Applicable to VCDX?
So you may be thinking, well thanks for all of the tips for a successful NPX architecture guide, but does this apply to the VMware VCDX certification or other enterprise architect certs? And the answer is ABSOLUTELY! In fact, I used the *exact* same format for my VCDX certification and it was accepted and successfully defended on my first attempt. If it's good enough for NPX/VCDX, it is good enough for customer facing docs? The answer here is also a resounding yes.
The tips I've provided here are for an enterprise level architecture guide, and "X" level certifications like VCDX and NPX are very similar in the skillset they attempt to asses. So can you take your NPX architecture guide, if based on vSphere, and submit it for VCDX? With only minor modifications to ensure you cover VCDX blueprint areas, the answer is yes! I did the reverse....started out with a VCDX-level design, added Nutanix blueprint areas, and submitted it for the NPX.
Conclusion
As you can see through these two long blog posts, a "X" (expert) level architecture guide can be a monster. It covers a lot of areas, needs full traceability, must be concise and easy to read, organized so that the examiners can easily score it, and also be technically accurate. It also needs enough depth to be considered "X" (expert) level. Although I will emphasize that there's no minimum page count, I would find it very hard to believe that something as short as, for example, 30 pages for an enterprise architecture guide would pass muster.
If you follow my advice in these two posts, it should get you well on your way to having a well organized, detailed, and easy to read/score architecture guide for your NPX (or VCDX) defense.