Principal Engineer – ML Ops

On-site Full Time

Katim -

UAE , Dubai

--

Apply on the Job Website

Job Details

Employment Full Time

Gender Both

Category Information Technology

About KATIMKATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.Job Purpose (specific To This Role)The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and Dev Sec Ops — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems. AI-Augmented Product Development Model (Context for the Role) We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making. Core Principles Security is integrated into every decision, from architecture to deployment. Repetitive tasks are automated; human effort is focused on strategy and problem-solving. Quality is measurable, enforced, and automated at every stage. All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality. Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments. Key Responsiblities AI MLOps Architecture & Governance (30%) Define the MLOps architecture and governance framework across products. Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers. Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments. Lead architectural designs and reviews for AI pipelines. Design and maintain LLM inference infrastructure Manage model registries and versioning (MLflow, Weights & Biases) Implement model serving solutions (Tensor Flow Serving, Torch Serve, v LLM) Optimize model performance and cost (quantization, caching, batching) Build and maintain vector databases (Pinecone, Weaviate, Chroma) Hardware and inference optimization awarenessAgent & Tool Development (25%) Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection). Build AI-assisted Dev Sec Ops utilities to automatically enforce compliance, logging, and audit policies. Build tool integrations for LLM agents (function calling, APIs) Implement retrieval-augmented generation (RAG) pipelines Create prompt management and versioning systems Monitor and optimize agent performanceCI/CT/CD Pipelines (20%) Build continuous integration pipelines for models and code Implement continuous training (CT) workflows Automate model deployment with rollback capabilities Create staging and production deployment strategies Integrate AI-assisted code review into CI/CD Building a continuous evaluation loopInfrastructure & Automation (15%) Manage cloud infrastructure (Kubernetes, serverless) Implement Infrastructure as Code (Terraform, Pulumi) Build monitoring and observability systems (Prometheus, Grafana, Data Dog) Automate operational tasks with AI agents Ensure security and compliance (OWASP, SOC2) - AI-specific securityDeveloper Enablement (10%) Provide tools and libraries for engineers to adopt AI-augmented workflows securely. Document AI/ML best practices and patterns Conduct training on MLOps tools and workflows Support engineers with AI integration challenges Maintain development environment parity AI Privacy, Governance, and Compliance Education and Minimum Qualification BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred. 8+ years in Dev Ops, SRE, or platform engineering 5+ years hands-on experience with ML/AI systems in production Deep understanding of LLMs and their operational requirements Experience building and maintaining CI/CD pipelines Strong Linux/Unix systems knowledge Cloud platform expertise (AWS, GCP, or Azure) Experience with container orchestration (Kubernetes)Key SkillsMLOps & AI: LLM Integration: Open AI API, Anthropic API, Hugging Face, Azure Open AI Model Serving: Tensor Flow Serving, Torch Serve, v LLM, Ollama Experiment Tracking: MLflow, Weights & Biases, Neptune.ai Model Registries: MLflow, Kubeflow, AWS Sage Maker Vector Databases: Pinecone, Weaviate, Chroma, Milvus Agent Frameworks: Lang Chain, Llama Index, Auto GPT, Semantic Kernel Fine-tuning: LoRA, QLoRA, prompt tuningData Engineering: Pipelines: Airflow, Prefect, Dagster Processing: Spark, Dask, Ray Streaming: Kafka, Pulsar, Kinesis Data Quality: Great Expectations, dbt Feature Stores: Feast, TectonDev Ops & Infrastructure: Containers: Docker, Kubernetes, Helm Cloud Platforms: AWS (Sage Maker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio) IaC: Terraform, Pulumi, Cloud Formation CI/CD: Git Hub Actions, Git Lab CI, Jenkins, Argo CD Orchestration: Kubernetes operators, KubeflowMonitoring & Observability: Metrics: Prometheus, Grafana, Cloud Watch Logging: ELK Stack, Loki, Cloud Watch Logs Tracing: Jaeger, Zipkin, Open Telemetry Alerting: Pager Duty, Opsgenie Model Monitoring: Arize, Fiddler, EvidentlyProgramming: Python: Primary language for ML/AI Libraries: Num Py, Pandas, PyTorch/Tensor Flow, scikit-learn Fast API, Flask for serving Go: For high-performance services and tooling Shell Scripting: Bash, Python for automation SQL: Advanced queries, optimizationAI-Assisted Operations: Autonomous agents for incident response AI-powered log analysis and anomaly detection Automated root cause analysis Intelligent alerting and noise reductionOther Highly Desirable Skills: Experience with LLM fine-tuning and deployment at scale Background in data engineering or ML engineering Startup or high-growth environment experience Security certifications (CISSP, AWS Security) Contributions to open source MLOps projects Experience with multi-cloud or hybrid cloud Prior software engineering experience Success Metrics Uptime: 99.9%+ availability for AI services Deployment Frequency: Daily or on-demand deployments Model Performance: Latency (p95 < 500ms), accuracy tracking Cost Efficiency: Cost per inference, infrastructure utilization Developer Velocity: Time to deploy new models, AI feature adoption rate Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)#KATIM

Similar Jobs

Apply on the Job Website

Principal Engineer – ML Ops

Job Details

Similar Jobs

About Katim

Site Owners

Employers

Job Seekers