Alpheya logo

Principal AI Engineer

Alpheya
15 hours ago
Full-time
Remote friendly (Abu Dhabi Abu Dhabi United Arab Emirates)
Worldwide

About Alpheya

We are a B2B WealthTech startup based in Abu Dhabi and backed by BNY Mellon (America’s oldest bank and first company to list on NYSE) and Lunate (a new $100B AUM alternative asset management firm based in Abu Dhabi, UAE). The company has raised $300M to build a state of the art wealth technology platform.

Our mission is to power and grow our clients’ Wealth franchises through differentiated experiences, financial solutions, and insights. Our digital wealth management platform- will enable banks and other financial institutions in the Middle East to grow and further penetrate affluent, HNW and UHNW investor segments.

While still leveraging the capabilities and knowledge of large organizations, our fintech is a startup with truly cross-functional and agile teams.

For more information, please visit www.alpheya.com

 

Role Overview

We're hiring a software engineer who builds production systems and who has spent the last few years applying that discipline to AI-powered products.

You will take validated AI prototypes and turn them into production-grade software systems. You’ll focus on reliability, observability, maintainability, and clear architecture for AI-powered features in a regulated environment.

You will also have responsibility for leading and mentoring a group of data and software engineers to deliver reliably and raise the engineering bar.

This is not a DevOps role. You will partner closely with our DevOps/SRE team (who owns core infrastructure, Kubernetes, and Terraform) to ensure AI services are operable and meet agreed SLAs.

Responsibilities:

  • Productionising AI Features (core focus)
    • Own the AI API surface in production: contracts/schemas, versioning, backward compatibility, and behaviour guarantees for downstream consumers
    • Take RAG/agent prototypes from notebook/PoC to production services: clean interfaces, robust runtime behavior, and safe rollout paths
    • Implement reliability patterns: timeouts, retries with backoff, idempotency, circuit breakers, rate limiting, graceful degradation, and fallbacks
    • Build observability end-to-end: structured logging, metrics, tracing (OpenTelemetry), and actionable dashboards/alerts
    • Own release quality: CI/CD for AI services, prompt/config versioning, regression tests, and staged deployments
    • Drive operational readiness: runbooks, on-call-friendly diagnostics, incident retros, and continuous hardening

  • Architecture & System Design (important gap to fill)
    • Design and evolve AI API contracts (endpoints/tool contracts), ensuring safe, stable interfaces and clear ownership boundaries
    • Design service boundaries and interfaces for AI capabilities (APIs, contracts, and dependencies)
    • Make pragmatic tradeoffs across latency, cost, quality, and compliance; document and communicate decisions
    • Define patterns for state, memory, and persistence in agentic workflows (including partial failure handling and recovery)
    • Establish integration patterns with existing platform services and data sources (without duplicating DevOps ownership)

  • Data & Retrieval Systems (as used by product features)
    • Build/operate ingestion and refresh pipelines that support product knowledge bases (freshness, lineage, auditability)
    • Implement retrieval quality monitoring (e.g., drift, relevance), caching strategies, and evaluation harnesses
    • Partner with data/analytics teams on data contracts, validation checks, and SLAs

  • Team Leadership & Engineering Standards
    • Lead and develop a team of data and software engineers. Set direction, review work, unblock people.
    • Run design reviews and code reviews that raise the bar without slowing delivery
    • Establish shared patterns and standards for production AI systems that the team can scale on
    • Raise the engineering bar: code reviews, design reviews, and shared standards for production AI systems
    • Collaborate across AI Product Engineering, Data Science, DevOps/SRE, Security, and Product to keep ownership boundaries clean

  • Innovation in AI SDLC & Product Delivery
    • Own the evolution of our AI SDLC and AI stack: evaluate, pilot, and productionize tools/practices that measurably improve quality, reliability, delivery speed, latency, or cost (with clear success metrics and rollback paths), and enable innovation by AI product engineers/data scientists through reusable frameworks, templates, and paved paths
    • Bring leading LLM engineering discipline into production
    • Translate new capabilities (agents/tooling) into stable, well-governed product APIs without compromising operability or compliance
    • You are a software engineer first. 7+ years building production backend systems, with strong opinions about API design, error handling, testing, and operability
    • Proven ability to turn ambiguous prototypes into reliable services with clear operational characteristics
    • Comfortable owning systems across the full lifecycle: design → build → launch → operate
    • TypeScript or Python at a production level: you write services, not scripts. Clean abstractions, proper error handling, tested code
    • You can lead engineers. You've mentored, set technical direction, and delivered through a team not just as an individual contributor

  • Technical Skills
    • Strong production-grade Python (or similar backend language): API/service development, performance, testing discipline
    • Solid understanding of reliability engineering: resiliency patterns, SLOs/SLAs, capacity planning, and incident response
    • Observability expertise: OpenTelemetry, metrics/alerting, tracing, and debugging distributed systems
    • Practical experience with LLM application stacks (RAG/agents/tooling) and evaluation/testing approaches
    • SQL fluency for investigating system behavior and data issues