Windows Virtualization Blogs : Cluster Service Component Architecture

Let's start learning together & I know it's going to be fun .

Please go through this design doc first which was initially polished in Jan 1, 1998, this is a design doc for MSCS features & the standards it’s using.

The Design and Architecture of the Microsoft Cluster Service

Cluster Service Component Architecture:

Core of the cluster service inner workings Consists of dependent modular components

Majority of the components are combined into a single process – clussvc.exe
Some delegations outside of the process to partition manager, cluster disk driver and cluster network driver.
Cluster resources run in the rhs.exe process (Resource Host Subsystem).
Cluster management APIs and COM components control interactions among all the cluster
components, the cluster resources and the cluster management interface.

Cluster service (clussvc.exe) architecture consists of three tiers

Top Tier involves Cluster Abstractions

Examples are cluster nodes, groups, resources .
It includes resource management in terms of state of resources, failure management, response to failure conditions

Middle Tier involves Cluster Operations

This defines the Inner workings of the cluster
Membership management, regroup operations, maintaining a consistent cluster configuration across all nodes

Bottom Tier involves Operating System Interaction

Here the cluster interfaces with the OS.
Works with partition manager, cluster disk driver (clusdisk.sys), cluster network driver (NetFT.sys) , file system (CSVFS.sys ) , security, network interface management, etc

Messaging

Provides all communication primitives within the cluster

Facilitates reliable point-to-point communications paths between all nodes in the cluster

Messaging components utilize both Uni-cast and Multicast type communications using a Request-Reply format

Note: these replace RPC Communications

Relies on Object Manager for routing data

Object Manager tracks which node in the cluster owns what object(s) thus allowing for the correct routing of messages to the right node for processing

Unicast Component

Reliable Unicast: ‘TCP-like’ connection to a single node

Multicast Component

Raw Multicast: Unicast messages sent to multiple nodes in a cluster
Good Enough Multicast (GEM):

* A Multicast interface that enforces virtual synchrony
* If a message is delivered to one cluster node in the current view, it should be delivered to all nodes in the view
* GEM messages include sequence numbers and view IDs (GID) for tracking.
* Retransmits messages if needed, confirms deliveries as part of a GEM Repair function which is reported to Membership Manager so Regroup Processes can be completed

Causal Multicast: Enforces causal ordering of all multicast messages

Ensures related messages are sent in the correct order to a node
Timestamps are used to make delivery decisions

Multicast Request-Reply (MRR): Replaces RPC communications. Tracks all messages using a request ID

* Uses GEM if a repair action is needed

Object Manager

Establishes and maintains an in memory relational database of all cluster objects
Objects are loaded into memory when the cluster service starts
Object manager knows what nodes own what objects
Provides routing information to Messaging so the messages pertaining to cluster objects are routed to the owning nodes
GUM update process ensures that the Object Manager database is consistent across all nodes in the cluster
Examples of objects include: Resources, Groups, Networks, etc…

Host Manager

Listens for incoming network connections from cluster members ( *Establishes these connections on port 3343 )
Confirms all RAW inbound network connections and then hands them off to the Security component
Secure Messaging Library encrypts or signs, validates, and authorizes the connection (Encryption is an option. Signing is the default).
After securing the connection, it is routed to the Network Driver so the NetFT route table can be built (updated)
Host Manager hands connections off the Membership Manger

Membership Manager

Responsible for arriving at a consensus or a common view of the cluster membership

Uses a multi-stage regroup algorithm to arrive at a common view
Uses a gossip protocol to determine connectivity to the nodes
Responsible for Node Join, Node failure, Intra-cluster communications failures, Full Mesh Topology failures

* Initiate regroup process on failures using a Regroup Protocol (RGP)

Receives secured inbound connections from Host Manager which are then processed by the Join Policy Manager (JPM)

JPM negotiates with all nodes in the view to determine which partition contains a majority of members and therefore will survive as the view of the cluster.
Members not in the surviving partition must rejoin the formed cluster

Membership Manager maintains the Node objects in Object Manager

* Initial node objects are pulled from the registry and are updated based on what JPM determines is the surviving partition

Global Update Manager

Primary mechanism for replicating updates to the cluster configuration to all nodes in the membership view (received from Membership Manager)
All updates are serialized and atomic

All updates are applied in the same order to all nodes
If the updates are successful on one node, it is assumed they will be successful on all nodes
Update failures on a node (a non-locker node) results in the node being removed from the membership and the cluster service is terminated
After a restart of the service, the node must re-join the membership view

GUM depends on Messaging to implement an ordered, atomic request-Reply sequence for all updates (refer to messaging slides)
GUM updates only permitted when node is the locker owner

Causal Multicast used to transfer lock ownership
MRR used for GUM updates
GEM for retransmits of GUM update messages as needed

GUM is responsible for transferring cluster state to all joining members

Resource Control Manager

External to the main cluster infrastructure but responsible for proper functioning of infrastructure components
Implements fail-over mechanisms and policies for the cluster service
Manages resources, groups of resources and resource type objects
Works in conjunction with Checkpoint Manager for resources that have registry and ~~crypto checkpoint~~ replication configured
Resource states: Online, Offline, Failed, Online Pending, Offline Pending
Group states: Online, Offline, Failed, Partial Online
RCM actions: Move, Failover, Failback
Establishes and maintains the resource Dependency Tree (Uses the dependency tree to execute actions on upstream resources)
Better resource isolation in case of a deadlock in RHS

Topology Manager

Discovers and maintains the network topology for the cluster
Provides information to the cluster network driver (NetFT.sys) concerning end point connections to cluster nodes so it can build an internal routing table
Handles all network and network interface API requests in the cluster
Passes network configuration information to Database Manager for inclusion in the cluster configuration database (cluster registry hive)
Receives notifications from NetFT regarding any changes to cluster networks which it then passes on to other resource managers
Configures the IP address for the Microsoft Failover Cluster Virtual Adapter
Periodically re-tries failed routes in case they become available

Database Manager

Responsible for the cluster database (cluster registry hive)
Manages one or more replicas of the database
If there is a witness disk configured, Database Manager will place a full copy of the database on this disk

Note : This is not true for a File Share Witness configuration

Database Manager ensures all replicas of the cluster database are synchronized (Paxos Tags)
The Common Log File System (CLFS) is used as a transaction log by DM for the cluster configuration database
If Database Manager cannot commit changes to the local node, the cluster service is terminated on that node
If changes cannot be committed to the replica located on a witness disk, the local copy of that witness disk hive (0.Cluster) will be unloaded and the witness disk will be marked as Failed
Checkpoint manager works with Database Manager to ensure Crypto and Registry checkpoint information is maintained in the cluster database

Quorum Manager

Determines if the current cluster membership (view) has achieved quorum or consensus
If quorum is not achieved or has been lost, Quorum Manger terminates the cluster across all nodes
Works with Resource Control Manager, Database Manager and Global Update Manager in cases where a witness disk is configured to bring the disk ‘Online’ should the cluster be just one vote short of achieving quorum

Note : If more than one vote is needed, no attempt is made to bring a witness disk online

Tracks the ownership of the quorum replica located on the witness disk
Handles all changes to the Quorum Model in the cluster
Four Quorum Models: Node Majority, Node and Disk Majority, Node and File Share Majority, and No Majority: Disk Only
Reconciles all replicas using Paxos tagging process and then updates all replicas as needed

Security Manager

Responsible for signing or encrypting of all messages exchanged among nodes in the cluster. Function is handled by Security Validator .

Default security is to sign cluster communications but can be changed using cluster.exe

‘Signing’ does not involve certificates
Cluster node name is included in the message as ‘proof’ of validity

Handles authentication and authorization functions for Host Manager

This is Windows Failover Clustering Whitepaper & WFC is still following the same standards as it was initially started with so I would recommend to go start with this doc first.

Windows Server 2008 R2

Cluster Service Component Architecture

No comments:

Post a Comment

Labels

Popular Posts