Large language models continue to rapidly transform how we interact with software, and among the most practically impactful LLM-driven tools are AI coding assistants, which have seen explosive adoption as software engineers tap into their potential to dramatically accelerate development workflows. Coding assistants offer a range of prompt-driven capabilities including generating new code, explaining existing code, debugging problems, suggesting improvements, and answering technical questions.
There are numerous coding assistant tools, interfaces and platforms available to developers, but Google’s Gemini, OpenAI’s GPT Codex and Anthropic’s Claude Code are the most widely used underlying LLMs that power them. With advances happening at such a fast pace, it can be difficult for teams to make the choice of which model is likely to work most effectively for them for the development job at hand.
Building a Benchmarking Framework with a Difference Loomery is developing a framework that assesses real-world AI coding assistant model performance through a product engineering lens. A product engineer bridges the gap between building and shipping - they don't just write code, but also think deeply about user experience, business impact, and product strategy, then take ownership of features from conception through to launch and iteration. Benchmarks such as those published by SWE-bench and LiveBench provide valuable technical insights, generally either evaluating performance on tasks that tend to be more academic in nature, or relying heavily on automation to judge the results of more realistic task outputs. We find that these benchmarks can be difficult to translate into a clear human understanding of how a model behaves in a product-focused engineering scenario representative of those we commonly encounter in our day-to-day work. One of the most frequently cited benchmarks - ARC-AGI , assesses each leading model in terms of “a system that can efficiently acquire new skills outside of its training data” - a significant metric to evaluate, but it’s hard to determine how that impacts the application of AI coding tools in a typical software development project context. We are taking a different approach - put the models to the test on the kinds of everyday tasks and code repos that we commonly work with as engineers at Loomery. Our aim is to simulate the flow of realistic development work chunks that we would typically deliver on client projects, such as adding new features to a website or mobile app, making UI improvements, or integrating a new API into an application.
Our Methodology Our benchmarking methodology is designed to guide our engineering best practices in terms of how we expertly harness coding assistants to maximise the product value we deliver to our clients. The challenges and repos that we assess coding assistants on are carefully selected to map directly to our day-to-day ways of working on product engineering deliverables. Challenges are divided up into 4 ‘T-shirt’ sizes: S, M, L and XL - representing increasing levels of functionality, complexity and codebase footprint. We run the challenges on publicly available open-source real-world product repos that are representative of the engineering domains and tech stacks in which we typically develop products for our clients (e.g. mobile - Swift / iOS native; web - TypeScript & React; backend services - TypeScript & AWS).
We test two different approaches to working with the coding assistant: ‘pairing’ - incremental use of targeted, short & frequent code-specific prompts, vs; ‘autonomous’ - providing the assistant with a full specification in one prompt, then optionally following up with subsequent course-correcting prompts as needed. Simultaneously, we also test two different formats of prompt content: 'contextualised' - longer prompts that provide detailed context about various aspects of the task and required output, for example details about the codebase, instructions on how to analyse and generate code, quality criteria for the generated code, specified format for the coding assistant's reply in the prompt terminal, vs; 'basic' - shorter and more concise prompts that are just long enough to clearly specify the desired outcome and leave it up to the coding assistant to make decisions about the approaches it takes. Performance is scored on a range of metrics grouped under the following categories (with specific reference examples of high & low scores defined for each metric), determining an overall score for each category: - Lead time - Context handling - Output quality - Productivity - Reliability - Usability & UX - Security & privacy - Learning curve & adaptability - Tool assistance - ROI/Value - Experimental metrics
Using Sub-AgentsBeyond the underlying models, each tool comes with ways for making the outputs more deterministic such as adding an AGENTS.md file for project context or adding SKILLS.md files specialising in specific domain areas in the case of Claude. To make our testing realistic to how we would use these tools on projects we leveraged these techniques as much as the tools allowed us rather than just relying on the raw model performance.
Insights from our First Run Pairing & Basic Prompt Format Come Out on Top Applying the 'pairing' approach (incremental use of targeted, small & frequent code-specific prompts) and/or the use of 'basic' (shorter and more concise) prompts results in higher quality output than applying their respective alternatives.
Domain-specific Findings GPT Codex performs best in the web app domain - scoring highest on challenges of all sizes In the iOS native domain, each of the leading models has its place, with Gemini performing best for small challenges, GPT Codex winning for medium-sized, and Claude Code giving the top results on large-sized challenges Claude Code generally did best on backend domain challenges, beating the opponents on all but XL-size, where GPT Codex took the crown Dive into the DetailsExplore the detailed results in full here , including all the info on challenges, repos and evaluation criteria.
Building on the Foundations - Future Enhancements Exploring Prompt Engineer Personas Whilst our first run has measured performance through the lens of experienced engineers working in the engineering domains & tech stacks that they specialise in, one of our broader aims is to carry out benchmarking from the perspective of various personas, in order to inform our views on the effective use of coding assistants by specialised and non-specialised prompt engineers alike.
A - Engineering domain expert: An engineering domain expert, with a clear idea of what they want out of the model - like best practices, patterns and frameworks. Strong idea of what GREAT looks likeB - Polyglot engineer: A software engineer, but not an engineering domain expert (e.g. an Android developer coding for the backend with Python). Strong idea of general software best practices but little to no idea for domain-specific requirements and patterns. Solid idea of what GOOD looks likeC - Vibe coder: A non-engineer/non-coder, who knows what they want in terms of product features but nothing when it comes to code and engineering practice. No barometer for engineering standardsA Living Framework This is v1 of our framework and results - we are seeking your feedback and collaboration as we further develop our methodology. Look out for future evaluation results and insights from Loomery as new models are released and LLM capabilities continue to progress.
What Next? Our findings can help us make more informed and intentional decisions about the ways in which we use coding assistants and the specific tools we select for given tasks. Do they mean that all engineers should buy a GPT license today? Not necessarily, but for web development tasks for example, the results indicate that it's a good place to start. Of course, the landscape is moving at lightning speed, and that's why we're going to re-run these tests on a regular basis. Sign up below to be notified when we publish new benchmarking results, and also get in touch via the 'Let's talk' button below to let us know your thoughts!
Sign up for updates about Loomery's latest AI Coding Assistant Benchmarking Results