How It Works: Agentic AI vs. Manual Grading
"We didn't just build a better test runner; we built a system that understands the 'intent' behind a student's submission. This is the move from manual spot-checks to autonomous evaluation."
What is Agentic AI?
Traditional grading methods—like manual spot-checks or brittle Selenium scripts—are passive. They require an educator to open every project, follow a rubric, and manually verify results. If a student uses a modern framework or a non-standard layout, the script fails, and the educator is back to square one.
Evals.sh uses Agentic AI. Our agents act as "Digital TAs." They don't rely on hardcoded selectors. Instead, they "look" at the project's DOM tree, analyze labels, and understand that a button labeled "Submit Assignment" is the correct target, regardless of how the student styled or nested it.
The Autonomous Execution Loop
- 1Perception: The agent scrapes the current page state, converting the DOM into a semantically dense manifest.
- 2Reasoning: Given the rubric goal (e.g., "Login and upload a file"), the agent plans the next interactive step based on common UX patterns detected in the student's work.
- 3Action: The agent performs the click, type, or scroll event and monitors the result.
DOM-Aware Navigation
The biggest challenge in automated evaluation is the Shadow DOM and dynamically loaded content. Traditional tools often "hang" or fail if an element isn't immediately present.
Our agents are built with asynchronous retry logic and visual diffing. They understand when a page is in a "loading" state and will wait for relevant UI elements to settle. This makes Evals.sh 10x more resilient than standard scripts when testing student projects built with different frameworks (React, Vue, Svelte, or vanilla JS).
Deterministic Scoring vs. AI Flakiness
A common critique of AI in grading is "hallucination"—the AI might give a different score each time. Evals.sh solves this by separating the Evaluation from the Scoring.
The agent performs the evaluation and records objective facts (e.g., "The user reached the success page"). Then, a deterministic scoring engine processes these facts against a fixed rubric. This ensures that every student is graded against the exact same technical standard.
Academic Integrity
We don't just check if the code runs. We scan for hardcoded solutions, plagiarism patterns, and exposed API keys, ensuring students actually built the logic themselves.
Professor-Grade Scale
Whether you have 20 students or 200, we deploy agent clusters in parallel. Process an entire semester's final projects in the time it takes to drink a coffee.
Ready to Scale Your Grading?
Evals.sh isn't just a testing tool; it's a way for professors to focus on high-impact teaching while the objective work of verification happens automatically. By understanding the intent behind the code, we provide a level of oversight that was previously impossible without a massive team of TAs.