Evaluating LLMs in web app generation

LMArena, a UC Berkeley research team, created WebDev Arena to evaluate the capabilities of large language models (LLMs) in web development. It’s a free open-source arena where two LLMs compete to build a web app. You can vote on which LLM performs better and view a leaderboard of the best models.

The challenges of LLM coding evals

LMArena faced key technical challenges in testing multiple LLMs building real-time web applications.

When benchmarking LLMs on web development tasks, there's a fundamental need to run the code output. While getting LLMs to generate code is straightforward, running it securely at scale presents several challenges:

Speed: For meaningful comparisons, the evaluator needs to see the two LLMs' outputs nearly simultaneously. When you're comparing output from two or multiple models at once, even small delays compound quickly and distort the results of the voting.

Code execution: Running coding evaluations means handling large amounts of code simultaneously. Each LLM can generate extensive code snippets - from simple DOM manipulations to complex React components. Running these side-by-side requires significant computational resources.

Isolation and security: Each piece of LLM-generated code needs to run in an isolated environment. When you're comparing models like Claude and GPT-4 simultaneously, their code can't interfere with each other or share any execution context. Running LLM-generated code requires complete isolation to prevent any potential system access or interference between different models' executions.

LLM code execution

These challenges led LMArena to implement E2B as their execution environment. Under the hood, each sandbox is a small VM, allowing WebDev Arena to run code from multiple LLMs simultaneously while maintaining strict isolation and performance standards.

Key factors in this decision included:

Security first - E2B's isolated environments ensure each LLM's code runs separately and securely
Quick startup - E2B sandboxes start in ~150ms, essential for real-time model comparison
Reliability at scale - Running multiple sandboxes simultaneously for different LLM battles

The E2B sandboxes provide isolated cloud environments specifically designed for running AI-generated code. E2B is agnostic towards the choice of techstack, frameworks, and models, which lets WebDev Arena run code from different LLMs (e.g., Claude, GPT-4, Gemini, DeepSeek, and Qwen).

“We have over 60,000 users who started over 300,000 E2B sandboxes” - Aryan Vichare, the co-creator of WebDev Arena

Implementing E2B

LMArena implemented E2B for WebDev Arena in November 2024. The integration involved working closely with E2B's team to ensure reliable execution of code from multiple LLMs.

The implementation consisted of handling sandbox creation and code execution. Here's how it works:

Sandbox Configuration and Creation

The team implemented a flexible sandbox system with configurable timeouts and templates

Dynamic Dependency Management

The system supports dynamic package installation based on LLM requirements:

Automatic detection of additional dependencies
Custom installation commands for different environments
Real-time logging of dependency installation status

Code Execution Pipeline

The system handles two types of code execution:

Code Interpreter Mode: For running and evaluating code with detailed output
Web Development Mode: For deploying and accessing web applications

“It took less than 2 hours to get E2B up and running” - Aryan Vichare, the co-creator of WebDev Arena

What's next

Since the launch of WebDev Arena with E2B's code execution layer, the platform has run over 50,000 model comparisons, with Claude 3.5 Sonnet currently being on top of the leaderboard, followed closely by DeepSeek-R1. (As of the publish date of this blog post). The team plans to expand the benchmark with more models and plan to do online general-purpose coding evaluations in the future.