Staff Developer, AI Evaluation & Reliability

Caseware

On-site

Medellin

Caseware is one of Canada's original Fintech companies, having led the global audit and accounting software industry for over 30 years, with more than 500,000 users across 130 countries and available in 16 different languages. While you might not have heard of us (yet) over 36,000 accounting and audit professionals list Caseware as a skill on their LinkedIn profiles!

As we build the next generation of intelligent, cloud-based solutions for auditors, accountants, and financial professionals, agentic AI is a core pillar of our strategy. We are developing a reusable, enterprise-grade agentic AI platform that enables product teams across Caseware Cloud to safely, consistently, and efficiently deliver AI-powered capabilities in highly regulated environments.

We are looking for a Staff Developer – AI Evaluation & Reliability to raise the bar on the quality, trustworthiness, and operational reliability of our AI platform. This is a senior individual contributor role with broad technical influence and leadership expectations. You will own how our agentic systems are evaluated, validated, and governed in production, and help define the standards that product teams across Caseware rely on.

In this role, you will provide technical stewardship for evaluation frameworks, reliability mechanisms, and compliance-aligned controls that sit at the center of Caseware’s AI strategy. You’ll partner closely with Staff Engineers, Product Management, QA, Security, Data, and Infrastructure teams to ensure the platform scales reliably, meets enterprise and regulatory standards, and delivers measurable value to both product teams and customers.

📍 Location: This is a fully remote position located in Colombia.

Contact

Maira Russo - Senior Talent Acquisition Partner

What you will be doing

Own and evolve evaluation strategy for LLM- and agent-based systems, including golden datasets, rubric-based scoring, reference-free evaluations, regression testing, and A/B experimentation.
Benchmark and analyze foundation model performance within Caseware’s domain, identifying capability gaps, failure modes, and opportunities for improvement.
Lead the design and optimization of Retrieval-Augmented Generation (RAG) pipelines, including embeddings, retrieval strategies, reranking, and retrieval quality metrics.
Design and maintain feedback and evaluation pipelines that connect real-world user behavior to measurable improvements in agent performance.
Apply data science techniques to analyze agent behavior, diagnose reliability issues, detect drift, and surface systemic risks.
Define and implement guardrails for agentic systems, including schema validation, content filtering, tool governance, and policy enforcement.
Establish approval gates, audit trails, and controlled rollout mechanisms for AI and agent changes, including feature flags, staged deployments, and kill switches.
Partner with Security and Data teams to embed privacy-by-design practices, including PII detection and masking, data minimization, and retention controls.
Support and influence SOC 2 and ISO 27001-aligned controls across AI data flows, including access management, logging, and incident response.
Act as a Staff-level technical leader, mentoring other engineers, shaping best practices, and raising the overall bar for AI reliability and evaluation across the organization.

What you’ll bring

Strong data science foundation, including Python, SQL, statistics, and experiment design.
Deep hands-on experience with LLMs, prompting strategies, and agent reasoning patterns.
Practical expertise with embeddings, vector databases, retrieval metrics, and reranking approaches.
Proven experience designing or operating evaluation frameworks for generative AI or agentic systems, including automated and human-in-the-loop evaluation.
Strong understanding of AI reliability, safety, and governance, including guardrails, validation, monitoring, and change control.
Working knowledge of privacy engineering principles and familiarity with GDPR/CCPA concepts such as consent, purpose limitation, and data subject rights.
Experience operating in enterprise or regulated environments, including contributions to SOC 2 / ISO 27001-aligned systems and processes.
Ability to influence across teams, communicate clearly about complex AI trade-offs, and drive alignment without direct authority.
Strong English language communication and collaboration skills

Nice to have

Experience with agent frameworks such as LangChain or similar.
Domain experience in finance, accounting, or other regulated industries (e.g., healthcare, legal).
Experience with AI safety or red-teaming, including prompt injection, data exfiltration, or tool misuse.
Familiarity with governed change management, including feature flags, staged rollouts, and kill switches.
Experience with agentic coding or autonomous development workflows.

Technology stack your team works with

Backend & Platform: TypeScript, NestJS, Python
Cloud & Infrastructure: AWS EKS, AWS Lambda, AWS Bedrock, AWS AgentCore
Search & Retrieval: AWS OpenSearch Serverless
Document & Data Processing: AWS Textract, DynamoDB, S3
AI Evaluation & Observability: LangFuse, LangSmith (or equivalent)
AI-assisted development tools: GitHub Copilot, AWS Kiro
Developer Tooling: GitHub, GitHub Actions, Nx Monorepo
Collaboration: Jira, Confluence, Microsoft Teams, Outlook

Perks & Benefits

¨Contrato a termino Indefinido¨ with all the legal benefits
Prepaid Medicine
Life insurance and funeral assistance
Internet allowance
Home office stipend
Competitive compensation — above the market average
100% remote work environment and an excellent work-life balance
Opportunity to work for a growing global SaaS leader company
A culture that promotes independence, innovation, trust, and accountability
Open space to be creative, innovative and strategize for the future
Mentorship by highly experienced professional
Budget for training, we want you to grow
5 Personal Time Off days per year
Sick Leave Top up to total 100% of salary paid by the employer from Day 3 to 90.
Recognition Award, additional paid time off in recognition of the corresponding year of service
Upgrade vacation starting at 5 years of service

What's in it for you:

▪️Innovation is at our core. We work with cutting-edge technology in accounting and financial reporting, constantly pushing the boundaries to create impactful software solutions.

▪️We are committed to a collaborative culture, where your ideas are valued, and knowledge sharing is encouraged within a supportive, inclusive team.

▪️Work-life balance is important to us. We offer flexible work options, remote opportunities, and generous time-off policies to ensure a healthy work-life balance.

▪️We offer competitive compensation, including a competitive salary and comprehensive benefits such as health insurance and retirement plans.

▪️We are driven by impactful work. Your contributions directly affect how our clients manage financial processes and drive their success.

▪️Recognition and rewards matter to us. We celebrate hard work through recognition programs, performance bonuses, and opportunities for career growth.

▪️We embrace global opportunities. Work on international projects and collaborate with a diverse, global team.

About Caseware:

Caseware's cutting-edge software products are meticulously designed for accounting firms, corporations, and governments. Our teams are continually collaborating, innovating, and building upon our existing suite of products. With a customer-focused mindset, we are building technology that is shaping what the future of audits, financial reporting, and financial data analytics will look like.

With a recent strategic investment from Hg Capital in 2020, Caseware is now in its next major growth phase as we double down on the people and products that have made Caseware so successful to date.

One of Caseware's core values is Many Voices, One Team and with that in mind, we're dedicated to building teams as diverse as our customers in an equitable and inclusive way. We welcome and encourage candidates of all backgrounds to apply. Should you require accommodations or have any questions at any point during the application or interview process, please e-mail our People Operations team at talent@caseware.com.

Background Check:

Any candidates successful in obtaining an offer for a position will need to successfully complete a background check through Certn.co which typically includes an Identity Verification and Criminal Record Check. Executives and Senior Managers will undergo a Soft Credit Check as well. Candidates residing in the Netherlands and Germany are excluded from undergoing background checks via Certn.co

Security and Fraud:

Caseware takes the security of candidates seriously. All legitimate communication from us will come from email addresses ending in @caseware.com and our open positions are always listed on reputable job boards and on our website https://jobs.lever.co/caseware. We will NEVER ask for payment or financial information from you. If you receive an unsolicited job offer, proceed with extreme caution.

Apply now

Staff Developer, AI Evaluation & Reliability

More jobs

Software Developer Tech Lead

Caseware

Software Development Manager, Data Platform

Caseware

Staff Developer, AI Evaluation &amp; Reliability

More jobs

Software Developer Tech Lead

Caseware

Software Development Manager, Data Platform

Caseware

Staff Developer, AI Evaluation & Reliability