Cluster Service Component Architecture

Let's start learning together & I know it's going to be fun . 

Please go through this design doc first which was initially polished in Jan 1, 1998, this is a design doc for MSCS features & the standards it’s using. 

Core of the cluster service inner workings Consists of dependent modular components
  • Majority of the components are combined into a single process – clussvc.exe
  • Some delegations outside of the process to partition manager, cluster disk driver and cluster network driver.
     
  • Cluster resources run in the rhs.exe process (Resource Host Subsystem).
     
  • Cluster management APIs and COM components control interactions among all the cluster 
  • components, the cluster resources and the cluster management interface.

Cluster service (clussvc.exe) architecture consists of three tiers
 
Top Tier involves Cluster Abstractions
  1. Examples are cluster nodes, groups, resources . 
  2. It includes resource management in terms of state of resources, failure management, response to failure conditions
Middle Tier involves Cluster Operations
  1. This defines the Inner workings of the cluster
  2. Membership management, regroup operations, maintaining a consistent cluster configuration across all nodes
Bottom Tier involves Operating System Interaction
  1. Here  the cluster interfaces with the OS.
  2. Works with partition manager, cluster disk driver (clusdisk.sys), cluster network driver (NetFT.sys) , file system (CSVFS.sys ) , security, network interface management, etc


Messaging

  • Provides all communication primitives within the cluster
Facilitates reliable point-to-point communications paths between all nodes in the cluster
  • Messaging components utilize both Uni-cast and Multicast type communications using a Request-Reply format
Note: these replace RPC Communications
  • Relies on Object Manager for routing data
Object Manager tracks which node in the cluster owns what object(s) thus allowing for the correct routing of messages to the right node for processing
  • Unicast Component
    Reliable Unicast: ‘TCP-like’ connection to a single node
  • Multicast Component
  1. Raw Multicast: Unicast messages sent to multiple nodes in a cluster
  2. Good Enough Multicast (GEM): 
* A Multicast interface that enforces virtual synchrony
* If a message is delivered to one cluster node in the current view, it should be delivered to all nodes   in the view
* GEM messages include sequence numbers and view IDs (GID) for tracking.
* Retransmits messages if needed, confirms deliveries as part of a GEM Repair function which is reported to Membership Manager so Regroup Processes can be completed

 
  • Causal Multicast: Enforces causal ordering of all multicast messages
  1. Ensures related messages are sent in the correct order to a node
  2. Timestamps are used to make delivery decisions
  • Multicast Request-Reply (MRR): Replaces RPC communications. Tracks all messages using a request ID
        * Uses GEM if a repair action is needed


Object Manager
  1. Establishes and maintains an in memory relational database of all cluster objects
  2. Objects are loaded into memory when the cluster service starts
  3. Object manager knows what nodes own what objects
  4. Provides routing information to Messaging so the messages pertaining to cluster objects are routed to the owning nodes
  5. GUM update process ensures that the Object Manager database is consistent across all nodes in the cluster
  6. Examples of objects include: Resources, Groups, Networks, etc…

Host Manager
 
  1. Listens for incoming network connections from cluster members ( *Establishes these connections on port 3343  )                       
  2. Confirms all RAW inbound network connections and then hands them off to the Security component                                                                                                                                                 
  3. Secure Messaging Library encrypts or signs, validates, and authorizes the connection (Encryption is an option.  Signing is the default).
  4. After securing the connection, it is routed to the Network Driver so the NetFT route table can be built (updated)
  5. Host Manager hands connections off the Membership Manger

Membership Manager
  • Responsible for arriving at a consensus or a common view of the cluster membership
  1. Uses a multi-stage regroup algorithm to arrive at a common view
  2. Uses a gossip protocol to determine connectivity to the nodes
  3. Responsible for Node Join, Node failure, Intra-cluster communications failures, Full Mesh Topology failures
          * Initiate regroup process on failures using a Regroup Protocol (RGP)
  • Receives secured inbound connections from Host Manager which are then processed by the Join Policy Manager (JPM)
  1. JPM negotiates with all nodes in the view to determine which partition contains a majority of members and therefore will survive as the view of the cluster.
  2. Members not in the surviving partition must rejoin the formed cluster
  • Membership Manager maintains the Node objects in Object Manager
           * Initial node objects are pulled from the registry and are updated based on what JPM determines is the surviving partition

Global Update Manager
  • Primary mechanism for replicating updates to the cluster configuration to all nodes in the membership view (received from Membership Manager)
  • All updates are serialized and atomic
  1. All updates are applied in the same order to all nodes
  2. If the updates are successful on one node, it is assumed they will be successful on all nodes
  3. Update failures on a node (a non-locker node) results in the node being removed from the membership and the cluster service is terminated
  4. After a restart of the service, the node must re-join the membership view
  • GUM depends on Messaging to implement an ordered, atomic request-Reply sequence for all updates  (refer to messaging slides)
  • GUM updates only permitted when node is the locker owner
  1. Causal Multicast used to transfer lock ownership
  2. MRR used for GUM updates
  3. GEM for retransmits of GUM update messages as needed
  • GUM is responsible for transferring cluster state to all joining members

Resource Control Manager
 
  1. External to the main cluster infrastructure but responsible for proper functioning of infrastructure components
  2. Implements fail-over mechanisms and policies for the cluster service
  3. Manages resources, groups of resources and resource type objects
  4. Works in conjunction with Checkpoint Manager for resources that have registry and crypto checkpoint replication configured
  5. Resource states: Online, Offline, Failed, Online Pending, Offline Pending
  6. Group states: Online, Offline, Failed, Partial Online
  7. RCM actions: Move, Failover, Failback
  8. Establishes and maintains the resource Dependency Tree  (Uses the dependency tree to execute actions on upstream resources)
  9. Better resource isolation in case of a deadlock in RHS

Topology Manager
  1. Discovers and maintains the network topology for the cluster
  2. Provides information to the cluster network driver (NetFT.sys) concerning end point connections to cluster nodes so it can build an internal routing table
  3. Handles all network and network interface API requests in the cluster
  4. Passes network configuration information to Database Manager for inclusion in the cluster configuration database (cluster registry hive)
  5. Receives notifications from NetFT regarding any changes to cluster networks which it then passes on to other resource managers
  6. Configures the IP address for the Microsoft Failover Cluster Virtual Adapter
  7. Periodically re-tries failed routes in case they become available
Database Manager
  • Responsible for the cluster database (cluster registry hive)
  • Manages one or more replicas of the database
  • If there is a witness disk configured, Database Manager will place a full copy of the database on this disk
        Note : This is not true for a File Share Witness configuration
  • Database Manager ensures all replicas of the cluster database are synchronized (Paxos Tags)
  • The Common Log File System (CLFS) is used as a transaction log by DM for the cluster configuration database
  • If Database Manager cannot commit changes to the local node, the cluster service is terminated on that node
  • If changes cannot be committed to the replica located on a witness disk, the local copy of that witness disk hive (0.Cluster) will be unloaded and the witness disk will be marked as Failed
  • Checkpoint manager works with Database Manager to ensure Crypto and Registry checkpoint information is maintained in the cluster database
Quorum Manager
  • Determines if the current cluster membership (view) has achieved quorum or consensus
  • If quorum is not achieved or has been lost, Quorum Manger terminates the cluster across all nodes
  • Works with Resource Control Manager, Database Manager and Global Update Manager in cases where a witness disk is configured to bring the disk ‘Online’ should the cluster be just one vote short of achieving quorum
             Note : If more than one vote is needed, no attempt is made to bring a witness disk online
  • Tracks the ownership of the quorum replica located on the witness disk
  • Handles all changes to the Quorum Model in the cluster
  • Four Quorum Models: Node Majority, Node and Disk Majority, Node and File Share Majority, and No Majority: Disk Only
  • Reconciles all replicas using Paxos tagging process and then updates all replicas as needed
Security Manager
  • Responsible for signing or encrypting of all messages exchanged among nodes in the cluster. Function is handled by Security Validator .
  • Default security is to sign cluster communications but can be changed using cluster.exe
  1.                   ‘Signing’  does not involve certificates
  2.                    Cluster node name is included in the message as ‘proof’ of validity
  • Handles authentication and authorization functions for Host Manager

This is Windows Failover Clustering Whitepaper & WFC is still following the same standards as it was initially started with so I would recommend to go start with this doc first. 

No comments:

Post a Comment