Skip to content

Project Plan Outline

Week 1 starting 2025-03-31: Metric Development and CLI Enhancements

Milestones

  • Metric Development: Implement at least three new metrics for evaluating agentic AI systems.
  • CLI Streaming: Enhance the CLI to stream Pydantic-AI output.

Tasks and Sequence

  • Research and Design New Metrics
  • Task Definition: Conduct literature review and design three new metrics that are agnostic to specific use cases but measure core agentic capabilities.
  • Sequence: Before implementing any code changes.
  • Definition of Done: A detailed document outlining the metrics, their mathematical formulations, and how they will be integrated into the evaluation pipeline.
  • Implement New Metrics
  • Task Definition: Write Python code to implement the new metrics, ensuring they are modular and easily integratable with existing evaluation logic.
  • Sequence: After completing the design document.
  • Definition of Done: Unit tests for each metric pass, and they are successfully integrated into the evaluation pipeline.
  • Enhance CLI for Streaming
  • Task Definition: Modify the CLI to stream Pydantic-AI output using asynchronous functions.
  • Sequence: Concurrently with metric implementation.
  • Definition of Done: The CLI can stream output from Pydantic-AI models without blocking, and tests demonstrate successful streaming.
  • Update Documentation
  • Task Definition: Update PRD.md and README.md to reflect new metrics and CLI enhancements.
  • Sequence: After completing metric implementation and CLI enhancements.
  • Definition of Done: PRD.md includes detailed descriptions of new metrics, and README.md provides instructions on how to use the enhanced CLI.

Week 2 starting 2025-03-07: Streamlit GUI Enhancements and Testing

Milestones

  • Streamlit GUI Output: Enhance the Streamlit GUI to display streamed output from Pydantic-AI.
  • Comprehensive Testing: Perform thorough testing of the entire system with new metrics and GUI enhancements.

Tasks and Sequence

  • Enhance Streamlit GUI
  • Task Definition: Modify the Streamlit GUI to display the streamed output from Pydantic-AI models.
  • Sequence: Start of Week 2.
  • Definition of Done: The GUI can display streamed output without errors, and user interactions (e.g., selecting models, inputting queries) work as expected.
  • Integrate New Metrics into GUI
  • Task Definition: Ensure the Streamlit GUI can display results from the new metrics.
  • Sequence: After enhancing the GUI for streamed output.
  • Definition of Done: The GUI displays metric results clearly, and users can easily interpret the output.
  • Comprehensive System Testing
  • Task Definition: Perform end-to-end testing of the system, including new metrics and GUI enhancements.
  • Sequence: After integrating new metrics into the GUI.
  • Definition of Done: All tests pass without errors, and the system functions as expected in various scenarios.
  • Finalize Documentation and Deployment
  • Task Definition: Update MkDocs documentation to reflect all changes and deploy it to GitHub Pages.
  • Sequence: After completing system testing.
  • Definition of Done: Documentation is updated, and the latest version is live on GitHub Pages.

Additional Considerations

  • Code Reviews: Schedule regular code reviews to ensure quality and adherence to project standards.
  • Feedback Loop: Establish a feedback loop with stakeholders to gather input on the new metrics and GUI enhancements.