Project Plan Outline¶

Week 1 starting 2025-03-31: Metric Development and CLI Enhancements¶

Metric Development: Implement at least three new metrics for evaluating agentic AI systems.
CLI Streaming: Enhance the CLI to stream Pydantic-AI output.

Research and Design New Metrics
Task Definition: Conduct literature review and design three new metrics that are agnostic to specific use cases but measure core agentic capabilities.
Sequence: Before implementing any code changes.
Definition of Done: A detailed document outlining the metrics, their mathematical formulations, and how they will be integrated into the evaluation pipeline.
Implement New Metrics
Task Definition: Write Python code to implement the new metrics, ensuring they are modular and easily integratable with existing evaluation logic.
Sequence: After completing the design document.
Definition of Done: Unit tests for each metric pass, and they are successfully integrated into the evaluation pipeline.
Enhance CLI for Streaming
Task Definition: Modify the CLI to stream Pydantic-AI output using asynchronous functions.
Sequence: Concurrently with metric implementation.
Definition of Done: The CLI can stream output from Pydantic-AI models without blocking, and tests demonstrate successful streaming.
Update Documentation
Task Definition: Update PRD.md and README.md to reflect new metrics and CLI enhancements.
Sequence: After completing metric implementation and CLI enhancements.
Definition of Done: PRD.md includes detailed descriptions of new metrics, and README.md provides instructions on how to use the enhanced CLI.

Streamlit GUI Output: Enhance the Streamlit GUI to display streamed output from Pydantic-AI.
Comprehensive Testing: Perform thorough testing of the entire system with new metrics and GUI enhancements.

Enhance Streamlit GUI
Task Definition: Modify the Streamlit GUI to display the streamed output from Pydantic-AI models.
Sequence: Start of Week 2.
Definition of Done: The GUI can display streamed output without errors, and user interactions (e.g., selecting models, inputting queries) work as expected.
Integrate New Metrics into GUI
Task Definition: Ensure the Streamlit GUI can display results from the new metrics.
Sequence: After enhancing the GUI for streamed output.
Definition of Done: The GUI displays metric results clearly, and users can easily interpret the output.
Comprehensive System Testing
Task Definition: Perform end-to-end testing of the system, including new metrics and GUI enhancements.
Sequence: After integrating new metrics into the GUI.
Definition of Done: All tests pass without errors, and the system functions as expected in various scenarios.
Finalize Documentation and Deployment
Task Definition: Update MkDocs documentation to reflect all changes and deploy it to GitHub Pages.
Sequence: After completing system testing.
Definition of Done: Documentation is updated, and the latest version is live on GitHub Pages.

Code Reviews: Schedule regular code reviews to ensure quality and adherence to project standards.
Feedback Loop: Establish a feedback loop with stakeholders to gather input on the new metrics and GUI enhancements.