In this blog, we’ll discuss how Percona XtraDB Cluster certification works. Percona XtraDB Cluster replicates actions executed on one node to all other nodes in the cluster and make it fast enough to appear as it if is synchronous (aka virtually synchronous).
Let’s understand all the things involved in the process (without losing data integrity).
There are two main types of actions: DDL and DML. DDL actions are executed using Total Order Isolation (let’s ignore Rolling Schema Upgrade for now) and DML using normal Galera replication protocol. This blog assumes the reader is aware of Total Order Isolation and MySQL replication protocol.
- DML (Insert/Update/Delete) operations effectively change the state of the database, and all such operations are recorded in XtraDB by registering a unique object identifier (aka key) for each change (an update or a new addition). Let’s understand this key concept in a bit more detail.
- A transaction can change “n” different data objects. Each such object change is recorded in XtraDB using a so-call append_key operation. The append_key operation registers the key of the data object that has undergone a change by the transaction. The key for rows can be represented in three parts as db_name, table_name, and pk_columns_for_table (if pk is absent, a hash of the complete row is calculated). In short there is quick and short meta information that this transaction has touched/modified following rows. This information is passed on as part of the write-set for certification to all the nodes of a cluster while the transaction is in the commit phase.
- For a transaction to commit it has to pass XtraDB-Galera certification, ensuring that transactions don’t conflict with any other changes posted on the cluster group/channel. Certification will add the keys modified by given the transaction to its own central certification vector (CCV), represented by cert_index_ng. If the said key is already part of the vector, then conflict resolution checks are triggered.
- Conflict resolution traces reference the transaction (that last modified this item in cluster group). If this reference transaction is from some other node, that suggests the same data was modified by the other node and changes of that node have been certified by the local node that is executing the check. In such cases, the transaction that arrived later fails to certify.
- Changes made to DB objects are bin-logged. This is the same as how MySQL does it for replication with its Master-Slave eco-system, except that a packet of changes from a given transaction is created and named as a write-set.
- Once the client/user issues a “COMMIT”, XtraDB Cluster will run a commit hook. Commit hooks ensure following:
- Flush the binlogs.
- Check if the transaction needs replication (not needed for read-only transactions like SELECT).
- If a transaction needs a replication, then it invokes a pre_commit hook in the Galera eco-system. During this pre-commit hook, a write-set is written in the group channel by a “replicate” operation. All nodes (including the one that executed the transaction) subscribes to this group-channel and reads the write-set.
- gcs_recv_thread is first to receive the packet, which is then processed through different action handlers.
- Each packet read from the group-channel is assigned an “id”, which is a locally maintained counter by each node in sync with the group. When any new node joins the group/cluster, a seed-id for it is initialized to the current active id from group/cluster. ( There is an inherent assumption/protocol enforcement that all nodes read the packet from a channel in same order, and that way even though each packet doesn’t carry “id” information it is inherently established u sing the local maintained “id” value).
/* Commonsituation - * incrementand assignact_idonlyfor totallyorderedactions * and onlyin PRIM (skipmessageswhile in stateexchange) */ rcvd->id = ++group->act_id_; [This is anamazingwayto solvetheproblemoftheid co-ordinationin multiplemaster system, otherwise a nodewillhaveto firstget anid fromcentralsystemor through a separate agreedprotocoland then use itfor thepacketthere-by doublingtheround-triptime].
- What happens if two nodes get ready with their packet at same time?
- Both nodes will be allowed to put the packet on the channel. That means the channel will see packets from different nodes queued one-behind-another.
- It is interesting to understand what happens if two nodes modify same set of rows. Let’s take an example:
create -> insert (1,2,3,4)....nodes arein synctillthis point. node-1: update i = i + 10; node-2: update i = i + 100; Let's associatetransaction-id (trx-id) for anupdatetransactionthatis executedon node-1 and node-2 in parallel (Therealalgorithmis bitmore involved (withuuid + seqno) but conceptuallythesamesofor ease I amusingtrx_id) node-1: updateaction: trx-id=n1x node-2: updateaction: trx-id=n2x Bothnodepacketsareaddedto thechannelbutthetransactionsareconflicting. Let's seewhich onesucceeds. Theprotocolsays: FIRSTWRITE WINS. Soin this case, whoeveris firstto write to thechannelwillget certified. Let's saynode-2 is firstto write thepacket and then node-1 makesimmediatelyafterit. NOTE: each nodesubscribesto allpackagesincludingitsownpackage. Seebelowfor details. Node-2: - Willseeitsownpacketand willprocessit. - Then itwillseenode-1 packetthatittriesto certifybutfails. (Willtalk about certificationprotocolin littlewhile) Node-1: - Willseenode-2 packetand willprocessit. (Note: InnoDBallowsisolationand so node-1 canprocessnode-2 packetsindependentofnode-1 transactionchanges) - Then itwillseethenode-1 packetthatittriesto certifybutfails. (Noteeventhoughthe packetoriginatedfromnode-1 itwillunder-go certificationto catch caseslikethes. This is beautyoflisteningto owneventsthatmake consistentprocessingpathevenif eventsarelocallygenerated)
- Now let’s talk about the certification protocol using the example sighted above. As discussed above, the central certification vector (CCV) is updated to reflect reference transaction.
Node-2: - node-2 seesitsownpacketfor certification, addsitto itslocalCCVand performs certificationchecks. Oncethesecheckspassitupdatesthereferencetransactionby settingitto "n2x" - node-2 then getsnode-1 packetfor certification. Saidkeyis alreadypresentin CCVwith a referencetransactionsetitto "n2x", whereaswrite-setproposessettingitto "n1x". This causes a conflict, whichin turncausesthenode-1 originatedtransactionto failthecertificationtest. This helpspointout a certificationfailureand thenode-1 packetis rejected. Node-1: - node-1 seesnode-2 packetfor certification, whichis then processed, the localCCVis updatedand thereferencetransactionis setto "n2x" - Usingthesamecase explainedabove, node-1 certificationalsorejectsthenode-1 packet. Wellthis suggeststhatthenodedoesn't needto wait for certificationto complete, but justneedsto ensurethatthepacketis writtento thechannel. Theappliertransactionwillalways winand thelocalconflictingtransactionwillberolledback.
- What happens if one of the nodes has local changes that are not synced with group.
create (id primarykey) -> insert (1), (2), (3), (4); node-1: wsrep_on=0; insert (5); wsrep_on=1 node-2: insert(5). insert(5) willgenerate a write-setthatwillthen bereplicatedto node-1. node-1 willtry to applyitbutwillfailwithduplicate-key-error, as 5 alreadyexist. XtraDBwillflagthis as anerror, whichwouldeventuallycausenode-1 to shutdown.
- With all that in place, how is GTID incremented if all the packets are processed by all nodes (including ones that are rejected due to certification)? GTID is incremented only when the transaction passes certification and is ready for commit. That way errant-packets don’t cause GTID to increment. Also, don’t confuse the group packet “id” quoted above with GTID. Without errant-packets, you may end up seeing these two counters going hand-in-hand, but they are no way related.
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » Peter Zaitsev: How Percona XtraDB Cluster certification works