-- BACKUP SIGNAL DIAGRAM COMPLEMENT TO BACKUP AMENDMENTS 2003-07-11 -- USER MASTER MASTER SLAVE SLAVE --------------------------------------------------------------------- BACKUP_REQ ----------------> UTIL_SEQUENCE ---------------> <--------------- DEFINE_BACKUP ------------------------------> (Local signals) LIST_TABLES ---------------> <--------------- FSOPEN ---------------> GET_TABINFO <--------------- DI_FCOUNT ---------------> <--------------- DI_GETPRIM ---------------> <--------------- <------------------------------- BACKUP_CONF <---------------- START_BACKUP ------------------------------> CREATE_TRIG --------------> <-------------- <------------------------------ WAIT_GCP --------------> <-------------- BACKUP_FRAGMENT ------------------------------> SCAN_FRAG ---------------> <--------------- <------------------------------ WAIT_GCP --------------> <-------------- STOP_BACKUP ------------------------------> DROP_TRIG --------------> <-------------- <------------------------------ BACKUP_COMPLETE_REP <---------------- ABORT_BACKUP ------------------------------> ---------------------------------------------------------------------------- USER BACKUP-MASTER 1) BACKUP_REQ --> 2) To all slaves DEFINE_BACKUP_REQ This signals contains info so that all slaves can take over as master Tomas: Except triggerId info... 3) Wait for conf 4) <-- BACKUP_CONF 5) For Each Table PREP_CREATE_TRIG_REQ Wait for Conf 6) To all slaves START_BACKUP_REQ Include trigger ids Wait for conf 7) For Each Table CREATE_TRIG_REQ Wait for conf 8) Wait for GCP 9) For each table For each fragment BACKUP_FRAGMENT_REQ --> <-- BACKUP_FRAGMENT_CONF 10) Wait for GCP 11) To all slaves STOP_BACKUP_REQ This signal turns off logging 12) Wait for conf 13) <-- BACKUP_COMPLETE_REP ---- Slave: Master Died Wait for master take-over, max 30 sec then abort everything Slave: Master TakeOver BACKUP_STATUS_REQ --> To all nodes <-- BACKUP_STATUS_CONF BACKUP_STATUS_CONF BACKUP_DEFINED BACKUP_STARTED BACKUP_FRAGMENT Master: Slave died -- Define Backup Req -- 1) Get backup definition Which tables (all) 2) Open files Write table list to CTL - file 3) Get definitions for all tables in backup 4) Get Fragment info 5) Define Backup Conf -- Define Backup Req -- -- Abort Backup Req -- 1) Report to others 2) Stop logging 3) Stop file(s) 4) Stop scan 5) If failure/abort Remove files 6) If XXX Report to user 7) Clean up records/stuff -- Abort Backup -- Reasons for aborting: 1a) client abort 1b) slave failure 1c) node failure Resources to be cleaned up: Slave responsability: 2a) Close and remove files 2b) Free allocated resources Master responsability: 2c) Drop triggers USER MASTER MASTER SLAVE SLAVE --------------------------------------------------------------------- BACKUP_ABORT_ORD: -------------------------(ALL)--> Set Master State ABORTING Set Slave State ABORTING Drop Triggers Close and Remove files CleanupSlaveResources() BACKUP_ABORT_ORD:OkToClean -------------------------(ALL)--> CleanupMasterResources() BACKUP_ABORT_REP <--------------- State descriptions: Master - INITIAL BACKUP_REQ -> Master - DEFINING DEFINE_BACKUP_CONF -> Master - DEFINED CREATE_TRIG_CONF -> Master - STARTED <---> Master - SCANNING WAIT_GCP_CONF -> Master - STOPPING (Master - CLEANING) -------- Master - ABORTING Slave - INITIAL DEFINE_BACKUP_REQ -> Slave - DEFINING - backupId - tables DIGETPRIMCONF -> Slave - DEFINED START_BACKUP_REQ -> Slave - STARTED Slave - SCANNING STOP_BACKUP_REQ -> Slave - STOPPING FSCLOSECONF -> Slave - CLEANING ----- Slave - ABORTING Testcases: 2. Master failure at first START_BACKUP_CONF error 10004 start backup - Ok 2. Master failure at first CREATE_TRIG_CONF error 10003 start backup - Ok 2. Master failure at first ALTER_TRIG_CONF error 10005 start backup - Ok 2. Master failure at WAIT_GCP_CONF error 10007 start backup - Ok 2. Master failure at WAIT_GCP_CONF, nextFragment error 10008 start backup - Ok 2. Master failure at WAIT_GCP_CONF, stopping error 10009 start backup - Ok 2. Master failure at BACKUP_FRAGMENT_CONF error 10010 start backup - Ok 2. Master failure at first DROP_TRIG_CONF error 10012 start backup - Ok 1. Master failure at first STOP_BACKUP_CONF error 10013 start backup - Ok 3. Multiple node failiure: error 10001 error 10014 start backup - Ok (note, mgmtsrvr does gets BACKUP_ABORT_REP but expects BACKUP_REF, hangs...) 4. Multiple node failiure: error 10007 error 10002 start backup - Ok ndbrequire(!ERROR_INSERTED(10001)); ndbrequire(!ERROR_INSERTED(10002)); ndbrequire(!ERROR_INSERTED(10021)); ndbrequire(!ERROR_INSERTED(10003)); ndbrequire(!ERROR_INSERTED(10004)); ndbrequire(!ERROR_INSERTED(10005)); ndbrequire(!ERROR_INSERTED(10006)); ndbrequire(!ERROR_INSERTED(10007)); ndbrequire(!ERROR_INSERTED(10008)); ndbrequire(!ERROR_INSERTED(10009)); ndbrequire(!ERROR_INSERTED(10010)); ndbrequire(!ERROR_INSERTED(10011)); ndbrequire(!ERROR_INSERTED(10012)); ndbrequire(!ERROR_INSERTED(10013)); ndbrequire(!ERROR_INSERTED(10014)); ndbrequire(!ERROR_INSERTED(10015)); ndbrequire(!ERROR_INSERTED(10016)); ndbrequire(!ERROR_INSERTED(10017)); ndbrequire(!ERROR_INSERTED(10018)); ndbrequire(!ERROR_INSERTED(10019)); ndbrequire(!ERROR_INSERTED(10020)); if (ERROR_INSERTED(10023)) { if (ERROR_INSERTED(10023)) { if (ERROR_INSERTED(10024)) { if (ERROR_INSERTED(10025)) { if (ERROR_INSERTED(10026)) { if (ERROR_INSERTED(10028)) { if (ERROR_INSERTED(10027)) { (ERROR_INSERTED(10022))) { if (ERROR_INSERTED(10029)) { if(trigPtr.p->operation->noOfBytes > 123 && ERROR_INSERTED(10030)) { ----- XXX --- DEFINE_BACKUP_REF -> ABORT_BACKUP_ORD(no reply) when all DEFINE_BACKUP replies has arrived START_BACKUP_REF ABORT_BACKUP_ORD(no reply) when all START_BACKUP_ replies has arrived BACKUP_FRAGMENT_REF ABORT_BACKUP_ORD(reply) directly to all nodes running BACKUP_FRAGMENT When all nodes has replied BACKUP_FRAGMENT ABORT_BACKUP_ORD(no reply) STOP_BACKUP_REF ABORT_BACKUP_ORD(no reply) when all STOP_BACKUP_ replies has arrived NF_COMPLETE_REP slave dies master sends OUTSTANDING_REF to self slave does nothing master dies slave elects self as master and sets only itself as participant DATA FORMATS ------------ Note: api-restore must be able to read all old formats. Todo: header formats 4.1.x ----- Todo 5.0.x ----- Producers: backup, Consumers: api-restore In 5.0 1) attrs have fixed size 2) external attr id (column_no) is same as internal attr id (attr_id). 3) no disk attributes Format: Part 0: null-bit mask for all nullable rounded to word Part 1: fixed + non-nullable in column_no order Part 2: fixed + nullable in column_no order Part 1: plain value rounded to words [value] Part 2: not-null => clear null bit, data words [len_in_words attr_id value] null => set only null bit in null-bit mask Note: redundancy in null-bit mask vs 2 word header 5.1.x ----- Producers: backup, Consumers: api-restore lcp-restore In 5.1 1) attrs can have var size, length encoded in value 2) column_no need not equal attr_id 3) disk attributes Null-bit mask (5.0) is dropped. Length encoded in value is not used. In "lcp backup" disk attributes are replaced by 64-bit DISK_REF. Format: Part 1: fixed + non-nullable in column_no order Part 2: other attributes Part 1: plain value rounded to words [value] Part 2: not-null => data words [len_in_bytes attr_id value] null => not present