Recap on ML — AI Agents & Dev Tooling by qte77

An Open Agentic Coding Harness — the Loop, the Plugins, the Senses, the Eval

June 19, 2026

Five open Apache-2.0 repos turn Claude Code into an autonomous coding harness — a TDD loop in parallel git worktrees, a plugin marketplace, local voice + vision, recursive agent teams, and a deterministic evaluator. Here's how they fit, with honest dates.

Building a Trustworthy Agent Loop for a Physical Lab

June 12, 2026

We build autonomous, self-evaluating agents — and we're building the hardest version of that idea, an open sub-$1k self-driving lab. This is the thesis, the plan, and the one piece that already ships. The AI scientist's missing half is the agent, not the model.

A $150 Pipetting Robot from a Stock 3D Printer

June 1, 2026

Turn a used Anycubic i3 Mega + a DLAB dPette+ electronic pipette into a 96-well disposable-tip pipetting robot. Marlin runs unmodified. Python drives. Apache-2.0.

GraphJudge — Measuring How Agents Collaborate

January 15, 2026

GraphJudge is an A2A-compliant multi-agent evaluation framework for the AgentBeats competition that captures interaction traces as directed graphs and scores coordination quality through structural graph metrics, LLM-as-judge qualitative assessment, and text similarity reproducibility checks.

AI Agents-eval Papers Meta Review

August 9, 2025

A meta-review of 50+ agentic AI evaluation papers identifying key dimensions — autonomy, multi-agent coordination, safety, and explainability — and critical gaps in standardization and long-term behavioral assessment.

AI Agents-eval Enhancement Recommendations

August 9, 2025

A prioritized roadmap of twelve proposed enhancements for the Agents-eval project, spanning multi-dimensional evaluation architecture, safety-first modules, predictive evaluation, and AgentOps integration.