Principal AI Engineer

Alpheya

15 hours ago

Full-time

Remote friendly (Abu Dhabi Abu Dhabi United Arab Emirates)

Worldwide

About Alpheya

We are a B2B WealthTech startup based in Abu Dhabi and backed by BNY Mellon (America’s oldest bank and first company to list on NYSE) and Lunate (a new $100B AUM alternative asset management firm based in Abu Dhabi, UAE). The company has raised $300M to build a state of the art wealth technology platform.

Our mission is to power and grow our clients’ Wealth franchises through differentiated experiences, financial solutions, and insights. Our digital wealth management platform- will enable banks and other financial institutions in the Middle East to grow and further penetrate affluent, HNW and UHNW investor segments.

While still leveraging the capabilities and knowledge of large organizations, our fintech is a startup with truly cross-functional and agile teams.

For more information, please visit www.alpheya.com

Role Overview

We're hiring a software engineer who builds production systems and who has spent the last few years applying that discipline to AI-powered products.

You will take validated AI prototypes and turn them into production-grade software systems. You’ll focus on reliability, observability, maintainability, and clear architecture for AI-powered features in a regulated environment.

You will also have responsibility for leading and mentoring a group of data and software engineers to deliver reliably and raise the engineering bar.

This is not a DevOps role. You will partner closely with our DevOps/SRE team (who owns core infrastructure, Kubernetes, and Terraform) to ensure AI services are operable and meet agreed SLAs.

Responsibilities:

Productionising AI Features (core focus)

Own the AI API surface in production: contracts/schemas, versioning, backward compatibility, and behaviour guarantees for downstream consumers
Take RAG/agent prototypes from notebook/PoC to production services: clean interfaces, robust runtime behavior, and safe rollout paths
Implement reliability patterns: timeouts, retries with backoff, idempotency, circuit breakers, rate limiting, graceful degradation, and fallbacks
Build observability end-to-end: structured logging, metrics, tracing (OpenTelemetry), and actionable dashboards/alerts
Own release quality: CI/CD for AI services, prompt/config versioning, regression tests, and staged deployments
Drive operational readiness: runbooks, on-call-friendly diagnostics, incident retros, and continuous hardening

Architecture & System Design (important gap to fill)

Design and evolve AI API contracts (endpoints/tool contracts), ensuring safe, stable interfaces and clear ownership boundaries
Design service boundaries and interfaces for AI capabilities (APIs, contracts, and dependencies)
Make pragmatic tradeoffs across latency, cost, quality, and compliance; document and communicate decisions
Define patterns for state, memory, and persistence in agentic workflows (including partial failure handling and recovery)
Establish integration patterns with existing platform services and data sources (without duplicating DevOps ownership)

Data & Retrieval Systems (as used by product features)

Build/operate ingestion and refresh pipelines that support product knowledge bases (freshness, lineage, auditability)
Implement retrieval quality monitoring (e.g., drift, relevance), caching strategies, and evaluation harnesses
Partner with data/analytics teams on data contracts, validation checks, and SLAs

Team Leadership & Engineering Standards

Lead and develop a team of data and software engineers. Set direction, review work, unblock people.
Run design reviews and code reviews that raise the bar without slowing delivery
Establish shared patterns and standards for production AI systems that the team can scale on
Raise the engineering bar: code reviews, design reviews, and shared standards for production AI systems
Collaborate across AI Product Engineering, Data Science, DevOps/SRE, Security, and Product to keep ownership boundaries clean

Innovation in AI SDLC & Product Delivery

Own the evolution of our AI SDLC and AI stack: evaluate, pilot, and productionize tools/practices that measurably improve quality, reliability, delivery speed, latency, or cost (with clear success metrics and rollback paths), and enable innovation by AI product engineers/data scientists through reusable frameworks, templates, and paved paths
Bring leading LLM engineering discipline into production
Translate new capabilities (agents/tooling) into stable, well-governed product APIs without compromising operability or compliance

You are a software engineer first. 7+ years building production backend systems, with strong opinions about API design, error handling, testing, and operability
Proven ability to turn ambiguous prototypes into reliable services with clear operational characteristics
Comfortable owning systems across the full lifecycle: design → build → launch → operate
TypeScript or Python at a production level: you write services, not scripts. Clean abstractions, proper error handling, tested code
You can lead engineers. You've mentored, set technical direction, and delivered through a team not just as an individual contributor

Technical Skills

Strong production-grade Python (or similar backend language): API/service development, performance, testing discipline
Solid understanding of reliability engineering: resiliency patterns, SLOs/SLAs, capacity planning, and incident response
Observability expertise: OpenTelemetry, metrics/alerting, tracing, and debugging distributed systems
Practical experience with LLM application stacks (RAG/agents/tooling) and evaluation/testing approaches
SQL fluency for investigating system behavior and data issues

Apply now

Principal AI Engineer

More jobs

Product Analyst (UAE N.)

Alpheya

DACH Growth Campaign Manager

Binance