PNOMA – A vSAN Troubleshooting Framework
There are many questions revolving around different layers of vSAN back-end architecture and how we can isolate problems with vSAN or in general approach towards troubleshooting a problem with vSAN . Here is an article which can help you isolate and potentially resolve a problem which involves a vSAN environment by isolating the problem in different layers by breaking down architecture in a simple way.
The approach towards troubleshooting a problem can be categorized in 5 layers similar to OSI model of networking . Application layer , Management layer , Object layer , Object layer and physical layer . We further see examples on how we can tag/categorize issues with each layer , which will help you figure out the source of the problem and quickly finding a solution . You may not be able to find the solution to a problem at all times however be able to categorize the issue under one of these layers , being able to explain the issue accurately to the Technical Support team can help them quickly narrow down the problem and fix that .
This PNOMA architecture includes certain critical components of vSAN which helps to isolate and troubleshoot vSAN , however there few other components we revolve around these and are dependent on one or more of these components .
This frame work consists of certain key elements like the vcenter server (VPXD,vsan health etc) , VPXA and Hostd services . These services are categorized in the application layer as all interactions between users and vSAN backend components happen over this layer either thru hosts directly or thru vcenter server . For example virtual machine creation/vmdk creation or deletion , storage policy creation/modification, enabling vSAN features and services , vSAN health monitoring , DG creation , FD creation .. etc . All these tasks are performed by the users primarily thru vcenter server either from the webclient or the RVC (Ruby Vsphere Console) , these tasks hence translate to actions at the back-end on individual hosts thru vCenter API to VPXA agent running on the hosts , the tasks are intern translated to actions on the HOSTD service (running on ESXi hosts) which starts to invoke required libraries to complete the task.
Once a task is received at the host layer and if the task is related to vSAN the host invokes required libraries for vSAN and facilitates certain type of tasks , some of such libraries / daemons are DISKLIB , OSFSD VSANVPD and few others .
Why are we categorizing these services under management ? , These are the critical services that are needed for us to be able to create/modify and delete an object within vSAN and if any of these services does not function correctly we will not able to create/modify and delete objects .
- DISKLIB : Disklib job is to invoke disk creation depending on the type of disk , this can be on a VMFS , NFS , vSAN or a vVOL depending on the type of datastore we choose , since we are discussing vSAN here this will invoke a vSAN object creation (can be a vdisk/namespace/vswap/vmem etc)
- OSFSD or OSFS-Daemon is responsible for the object creation/query task within the vSAN filesystem .
- vSANVPD or vSAN VASA Provider Daemon : vSAN uses vsanvpd service to expose its feature of SPBM , RAID , fault-tolerance , object space reservation , striping ..etc . The ESXi hosts (Nodes) part of the vSAN cluster runs the VASA provider and exposes this to the vCenter server over port 8080 so that vCenter can understand all the features and capabilities of vSAN . If the vSANVPD services are down you will not be able to create new VMs or change policies for existing virtual machines . See Troubleshooting Guide for vSAN VASA providers
The object framework is all about object life cycle within the vSAN filesystem . The key components involved in a vSAN life-cycle are DOM , CLOM and CMMDS . These components are responsible to create the objects with a specific configuration defined thru SPBM , validate the configuration if it can actually be satisfied or not , once the object is created within vSAN . It has to update all the host about the object type , owner of the object , policy/layout of the object , child components associated with it to all the hosts .
- Distributed Object Manager (DOM): The DOM is responsible for creating the components and distribute them across the cluster. Once a DOM-object is created one of the nodes (Host) will be nominated as the DOM owner for that object and this host will be responsible to handle all IOPS to that DOM-Object (Ex : vdisk, snapshot, vmnamespace, vmswap, vmem..etc) by locating the respective child components across cluster and redirecting the IO to respective components over vSAN network
- Cluster Monitoring, Membership and Directory Services (CMMDS): The purpose of CMMDS is to discover and maintain the vSAN cluster. It stores metadata information such as policies, and RAID configuration for all objects within vSAN.
- Cluster Level Object Manager (CLOM): Given a storage policy, checks to see if there are enough disk groups to satisfy that policy. CLOM is the brain which decides on the components and witnesses that need to be created and where they need to be placed in a cluster.
- Reliable Datagram Transport(RDT): RDT is the protocol used by vsan for communication between hosts over the vSAN vmkernel ports (cmmds , I/O flow etc). It is optimized to send very large files.
This layer is where data traverses and reside within a vSAN file-system hence the name physical layer. Important components which constitutes the physical layer are LSOM , Cache Tier drives and Capacity tier drives .
- Local Log-structured Object Manager(LSOM): The LSOM is responsible for locally storing the data on vSAN file system as vSAN Component or LSOM-Object (data component / witness component). These objects are created on top of the capacity tier drives depending on the geometry size advised by CLOM . This also includes PLOG and LLOG which stores the metadata for vSAN Virsto file-system and de-dupe metadata information .
- Cache Tier drive : As the name suggests the sole purpose of this drive is to help I/O traverse faster within vSAN , these drives are generally faster and higher endurance SSD/NVMe drives which are used to cache reads and writes in a hybrid vSAN cluster and is dedicated for writes in a All-Flash environment .
- Capacity Tier drive : These drive store LSOM Objects created within the vSAN filesystem , they directly service reads in an All-Flash environment. When De-dup&compression is executed while the data gets de-staged from cache tier to capacity .
I am planning to write some articles in the near future by illustrating different issues and categorize them under PNOMA which should give us better understanding . Please make sure to follow the blog for more content .