VMworld 2012: Avoiding 19 Biggest HA & DRS Mistakes INF-VSP1232

Greg Shields, Concentrated Technology partner. This session focused on the main HA/DRS mistakes that people make when virtualizing their infrastructure. Greg is a great speaker and also has presented sessions at TechEd and other conferences. HA/DRS settings are so easy to set and forget, or forget to set, that everyone should review all 19 mistakes and make sure you aren’t doing them in your environment.

  • HA/DRS solves two problems: Protection from unplanned downtime; Load Balancing and defragmentation of resources.
  • Large number of environments have configured HA/DRS settings incorrectly.
  • Mistake #1: Not planning for HW evolution.
    • vMotion requires similar processors.
    • Always set EVC mode
  • Mistake #2: Not planning for svMotion
    • VMs cannot have snapshots
    • VM disks must be persistent mode or RDMs
    • Host must have sufficient resources to support two instances of the VMs running concurrently.
    • Must be licensed and correctly configured with vMotion
    • Host must have access to both source and target datastores
  • Mistake #3: Not Enough Cluster Hosts
    • For HA failover requires additional “wasted” hardware resources
    • Must plan for cluster reserve
    • A fully prepared cluster must set aside one full server’s worth of resources in preparation for HA.
    • Enable admission control – Super important to enable. Will disallow starting VMs when resources are exhausted.
    • Set host failures cluster tolerates to 1 (or more). Ensures you always have at least one hosts’s worth of resources available.
  • Mistake #4: Setting Host Failures the Cluster Tolerates to 1
    • Not all your VMs are priority one
    • Some VMs can stay down if a host dies
    • Can set the % to less than one server’s worth of resources, since not all VMs need to restart if a host fails.
  • Mistake #5: Forgetting to Prioritize VM restart
    • VM restart priority is one of those oft-forgotten settings
    • Come into play when Percentage policy is enabled
    • Restart policy is per-host
    • Per-VM settings must be configured for each VM
    • This can create a problem down the road, as VMs may restart in the wrong order
  • Mistake #6: Disabling Admission Control
    • Many young admins may turn it off and forget about it
    • Never disable admission control!!!
  • Mistake #7: Not updating Percentage Policy
    • Needs to be adjusted as your cluster size changes
    • Host failures the cluster tolerates needs no adjusting
  • Mistake #8: Buying (the occasional) Big Server
    • Host failures the cluster tolerates sets aside the amount of resources the protect every server.
    • It must set aside resources equal to your biggest server in the cluster
  • Mistake #9: Neglecting Host Isolation Response
    • Current recommendation is to leave powered on
    • On converged networks you may not want to use powered on
    • Heartbeat datastores – Adds redundancy
  • Mistake #10: Assuming that Datastore heartbeats Prevent isolation Events
    • Master determines the state of the unresponsive host
    • Isolation response is triggered by the slave
  • Mistake #11: Confusing your ADP with your PDL
    • An All Point Down scenario exists when all communication is severed between host and device
    • I/O is then queued until a SCSI response code officially reports the link is down
    • This can lead to infinite queuing of device I/O
    • Permanent device loss scenario exists when the host can see the device target but the target isn’t listening
      • Lets the host recognize the I/O
    • APD is a more common scenario and APD will not trigger vSphere HA
    • Look at new settings in 5.0 U1 and 5.1 – Most handy for metro clusters
  • Mistake #12: Overdoing Reservations, limits, and Affinities
    • HA may not consider these “soft affinities” at failover
    • Consider using shares over reservations and limits
      • Less impact on DRS and thus HA
  • Mistake #13: Considering Using Shares without Considering using Shares
    • Shares are only considered during periods of contention
    • But settings shares on resource pools can have unexpected results
    • Don’t treat resource pools like folders
  • Mistake #14: Doing memory limits at all
    • Don’t assign memory limits. Ever.
    • Limit memory closest to the application as possible (such as in the SQL app)
  • Mistake #15: Thinking you are smarter than DRS
    • Not using fully automated mode
  • Mistake #16: Not understanding DRS’ Equations
    • Every 5 minutes a DRS interval is invoked
    • Takes into account VM entitlements, host capacity
  • Mistake #17: Being too liberal with your migration threshold
    • Pri 1 recommendations are mandatory
  • Mistake #18: Combining VDI and Server Workloads in the same cluster
    • ESXi hosts running VDI workloads tend to experience more load than running server workloads.
    • VDI forces DRS to work harder and more often
    • Create separate clusters for VDI and everything else
  • Mistake #19: Planning on Overcommit
    • Over commit creates extra work for the hypervisor
    • Assign the right amount of memory to your VMs

Related Posts

Subscribe
Notify of
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments