Site Reliability Engineering Lead ( Global, Open to Location )

UAE - Dubai
SRE Leader JD What you'll doStrategy and Governance Formulate and implement the company-level reliability strategy and SLO/ error budgeting mechanism, and establish a reliability measurement system centered on business impact. Establish release and change governance (access control, canary, rollback, freeze window), and promote the quantification and standardization of change risks. Establish a unified incident response system (SEV classification, IM/IC command mechanism, internal and external communication), and promote no-responsibility review and systematic improvement. Team and Organization Form and manage the SRE team (platform SRE, domain SRE, NOC), clarify role responsibilities and rotation mechanisms, and build an engineering culture and talent echelon Cross-team collaboration, working with R&D, architecture, DBA, network, security, legal/compliance, to drive the inclusion of reliability goals in the roadmap and KPIs. Platform and Engineering Implementatio Observability platform: Unify metric/log/tracking norms and SDKS, build high-availability data pipelines and alarm denoising systems, emergency platforms and support order systems. Delivery platform: CI/CD, Git Ops, feature switch, progressive release, policy check and image signature to enhance release quality and frequency. Capacity and Performance Engineering: Benchmark and stress testing, capacity prediction and resilience, hotspot isolation and degradation strategies, ensuring controllable degradation in extreme market conditions. Disaster recovery and business continuity: Multi-AZ/multi-Region architecture, RTO/RPO objectives and drills, data backup/recovery and consistency guarantee of reconciliation. Chaos Engineering and Fault Drills: Support core businesses in conducting fault drills, and build a chaos engineering platform to identify and promptly address potential system risks. Exchange Scenario Special Project Matching and low latency: end-to-end latency SLI, matching confirmation and replay, serial number consistency and idempotence, isolation of hot trading pairs. Wallet and on-chain interaction: Multi-chain node operation and maintenance, congestion and reorg handling, MPC/HSM, risk control and approval flow for coin withdrawal and deposit, closed loop for reconciliation errors. API and Market Conditions: WS back pressure, zoning and sharding, GSLB/ nearby access, speed limit and downsampling in sudden market conditions. Security and Compliance: DDoS/WAF/ bot governance, dual-person review and audit of sensitive operations, meeting requirements such as SOC2/ISO 27001/PCI-DSS. Indicators and Improvements Define and align reliability /KPI indicators to drive the improvement closed loop: SLO achievement, MTTA/MTTR, change failure rate, accident recurrence rate, Toil proportion, cost/ transaction volume ratio, etc.What you'll needOver 8 years of experience in back-end/platform/operation and maintenance engineering, over 4 years of SRE or production engineering experience, and over 2 years of team management/leadership experience. There are successful cases of stability governance and incident handling in high-concurrency and low-latency businesses (transactions/payments/advertising/large-scale real-time systems). Skills SLO/SLI and incorrect budgeting practices, observability system construction (Prometheus/Grafana/ELK or similar, Open Telemetry, Tracing). Kubernetes/Service Mesh, microservice gateway (Nginx/Envoy), CI/CD (Git Hub Actions/Git Lab CI, etc.), Git Ops (Argo CD). Design and implementation of progressive delivery (Canary/Batch/feature Switch) and automatic rollback strategies. Data and Storage: MySQL/ Sharding/Replication and Failover, Redis/Kafka, Backup and Disaster Recovery Drills; Consistency and reconciliation thinking. Performance and Capacity Engineering: Stress testing, benchmarking, analysis and tuning (flame diagram /CPU/GC/ Network /TCP kernel parameters, etc.). Event management: SEV grading, IM/IC command, cross-team collaboration and communication, writing high-quality retrospections and tracking action items. Soft quality Strong sense of ownership and risk control mindset, data-driven, and good at balancing reliability, speed and cost. Outstanding cross-departmental communication and influence can drive the implementation of strategies and cultural transformation. Fluent in both Chinese and English reading and writing, capable of handling overseas cloud/compliance communication (if involved in overseas markets). Bonus points (priority consideration Experience in exchange/matching/payment clearing and settlement/operation and maintenance of securities firms or crypto wallets and chain nodes. Experience in implementing anti-ddos, WAF, Bot management, rate limiting and traffic governance systems. Experience in compliance systems (SOC2, ISO 27001, PCI-DSS, SOX-class controls), security audits and evidence retention. Experience in multi-region GSLB, cross-cloud/multi-cloud architecture, Chaos engineering and Game Day organization. Go/Java optimization experience, practical experience in messaging systems (Kafka/Rocket MQ/Pulsar) and storage (TiDB/Vitess/Citus/TDSQL, etc.). Have experience in cost optimization and Fin Ops.Job Responsibilities:Strategy and Governance:Formulate and implement a company-level reliability strategy and SLO/error budgeting mechanism, and establish a reliability measurement system centered on business impact. Establish release and change governance (access control, canary, rollback, freeze window) to promote the quantification and standardization of change risks. Establish a unified incident response system (SEV classification, IM/IC command mechanism, internal and external communication) to promote non-accountability review and systematic improvement. Teams and Organizations:Establish and manage SRE teams (platform SREs, domain SREs, NOCs), clarify roles, responsibilities and rotation mechanisms, and build engineering culture and talent echelons Collaborate across teams to drive reliability goals into roadmaps and KPIs with R&D, architecture, DBA, networking, security, and legal/compliance. Platform and Engineering Implementation:Observability platform: unified indicators/logs/tracking specifications and SDKs, building high-availability data pipelines and alarm denoising systems, emergency platforms and support single systems. Delivery platform: CI/CD, Git Ops, feature switching, progressive release, policy checking, and image signing to improve the quality and frequency of releases. Capacity and performance engineering: benchmarking and stress testing, capacity prediction and elasticity, hotspot isolation and downgrading strategies to ensure the controllable degradation of extreme market conditions.Exchange Scenario Special:Matching and low latency: End-to-end latency SLI, matching confirmation and playback, serial number consistency and idempotency, and isolation of hot trading pairs. Wallet and on-chain interaction: multi-chain node operation and maintenance, congestion and reorg processing, MPC/HSM, risk control and approval flow for deposits and withdrawals, and closed-loop reconciliation errors. API and market: WS back pressure, partition sharding, GSLB/nearby access, rate limiting and downsampling in burst situations. Security and compliance: DDoS/WAF/bot governance, two-person review and audit of sensitive operations, and SOC2/ISO 27001/PCI-DSS requirements. Metrics and Improvements:Define and align reliability/KPI metrics to drive a closed loop of improvement: SLO achievement, MTTA/MTTR, change failure rate, incident recurrence rate, Toil share, cost/volume ratio, etc.Job Requirements:More than 8 years of experience in back-end/platform/O&M engineering, more than 4 years of experience in SRE or production engineering, and more than 2 years of experience in team management/team leadership. There are successful cases of stability governance and accident handling in high-concurrency and low-latency services (transactions/payments/advertising/large-scale real-time systems). Skill:SLO/SLI and error budgeting practices, observability system construction (Prometheus/Grafana/ELK or similar, Open Telemetry, Tracing). Kubernetes/Service Mesh, Microservices Gateway (Nginx/Envoy), CI/CD (Git Hub Actions/Git Lab CI, etc.), Git Ops (Argo CD). Progressive delivery (canary/batch/feature switch) and automatic rollback strategy design and implementation. Data and storage: MySQL/sharding/replication and failover, Redis/Kafka, backup and disaster recovery drills; Consistency and reconciliation thinking. Performance and capacity engineering: stress testing, benchmarking, analysis and tuning (flame diagrams/CPU/GC/network/TCP kernel parameters, etc.). Incident management: SEV hierarchy, IM/IC command, cross-team collaboration and communication, writing high-quality reviews and action item tracking. Soft quality Strong sense of ownership and risk control mentality, data-driven, good at weighing reliability/speed/cost. Excellent cross-departmental communication and influence can drive strategy implementation and cultural change. Fluent in Chinese and English, able to handle overseas cloud/compliance communication (if involving overseas markets). Bonus Points (Preferred) Experience in exchange/matching/payment clearing/brokerage or crypto wallet and chain node operation and maintenance. Experience in implementing anti-DDoS, WAF, bot management, rate limiting, and traffic governance systems. Experience in compliance systems (SOC2, ISO 27001, PCI-DSS, SOX control), security audits and evidence retention. Experience in multi-region GSLB, cross-cloud/multi-cloud architecture, Chaos engineering, and Game Day organizations. Go/Java optimization experience, messaging system (Kafka/Rocket MQ/Pulsar) and storage (TiDB/Vitess/Citus/TDSQL, etc.) practice. Experience in cost optimization and Fin Ops.
Post date: 4 December 2025
Publisher:
Post date: 4 December 2025
Publisher: