Changelog

New updates and improvements to Baserun

Changelog

New updates and improvements to Baserun

Changelog

New updates and improvements to Baserun

Changelog

New updates and improvements to Baserun

Changelog

New updates and improvements to Baserun

Changelog

New updates and improvements to Baserun

March 12, 2024

March 12, 2024

March 12, 2024

March 12, 2024

Evaluate & improve your prompts with AI — automatically

  • Prompt wizard public launch: Accelerate prompt iteration with AI assistance.

  • Add Claude 3 support in Python/JS SDK

  • Add the JSON schema evaluator

  • Improve Dataset management usability

  • Add filters and delete action for offline evaluation runs

March 6, 2024

March 6, 2024

March 6, 2024

March 6, 2024

Claude-3 is now available in Baserun Playground

  • Add Claude-3 to the Baserun Playground.

  • Improve Llama models' performance in the Playground.

  • Add Gemini Pro support in the Python/JS SDK.

  • Add the Regex evaluator.

  • Add the "Collect Input Variable" feature in the Python/JS SDK.

  • Make various UI improvements in the offline testing report view.

March 1, 2024

March 1, 2024

March 1, 2024

March 1, 2024

Evaluation for Both Offline Testing and Production Monitoring

Developers can now create evaluators using custom code or a custom LLM prompt to grade testing results pre-release or to monitor production post-release.

  • Support both unit tests and end-to-end tests. Like traces, users can assess the outputs for a particular LLM call or the entire pipeline.


  • Comparing changes side-by-side: With our comparison report, developers can compare two branches before merging a PR to prevent regressions.


  • Maximum flexibility: Baserun supports automatic evaluators, human evaluators, feedback, and checks. Automatic evaluators include string match, fuzzy match, JSON validation, and Regex match. Model-graded evaluators include fact checking, closed QA, and security evaluations. Users can also define their own custom code evaluators or custom model-graded evaluators. More evaluators to come!


  • Saving costs on evaluation: Baserun automatically caches your evaluation results to avoid running redundant assessments.


  • Prototype and evaluate prompts in the Baserun UI: Anybody (not just developers!) can prototype ideas in the Playground, test and assess prompts with our Evaluation features in the UI, and deploy changes to staging or production environments using the Prompt Directory feature.

February 23, 2024

February 23, 2024

February 23, 2024

February 23, 2024

Custom model evaluator with the model of your choice & Prompt templating GA

  • New Offline Evaluation Reports: Many UI improvements have been made to how metrics and evaluation details are displayed.

  • Custom Model Grade Evaluation: Users can now select which OpenAI model to evaluate, including custom fine-tuned OpenAI models to grade LLM outputs.

  • Testing SDK Bug Fix: Fixed an issue with evaluator result synchronization.

  • Prompt Playground: Users can now load a pre-defined template in a playground session and save a version of the prompt template after editing and testing the new version.

  • JS SDK Now Supports Prompt Templates: Users can create or edit templates’ versions in either UI or code

February 16, 2024

February 16, 2024

February 16, 2024

February 16, 2024

Custom model & custom code evaluator

- Create evaluations tailored to meet your app's specific objectives;
- Playground performance improvements: now you can stream multiple tests in parallel;
- Comparison view performance improvements;
- New evaluator editor UI (Beta);

January 31, 2024

January 31, 2024

January 31, 2024

January 31, 2024

Manage, evaluate & deploy prompts without touching the code

Introduce Prompt Directory:

Once a prompt template is registered through the SDK/UI, product owners or other non-technical stakeholders can use Baserun UI to compare each version's performance, iterate in the playground, run evaluations, and deploy new versions into production.

  • Observe all environments

  • Support back-testing and annotation

  • Collaborative workspace

  • Experiment in live products

This feature is currently in beta. Email us at hello@baserun.ai if you want to try it out now.

In addition to the Prompt Directory:

  • Add Gemini and new GPT-4 Turbo support in the Playground.

  • Add Intercom integration to let users message us anytime.

  • Improved comparing feature UX.

January 23, 2024

January 23, 2024

January 23, 2024

January 23, 2024

Expand Your Horizons with Baserun Python SDK 0.9.8 – Now Compatible with Open Source and BYO Models!

  • Monitoring: The Baserun Python SDK now supports automatically logging both Open Source and BYO models, requiring no additional code changes. Install, authenticate, and you're good to go.

  • Playground: Introducing a Brand New Chat Mode - Elevate your productivity with added thread tabs, hotkeys, and a rerun feature. Triple your iteration speed for more efficient workflows.


  • Prompt Wizard: Enhanced Prompt Improvement Suggestions - Now in Beta, this feature automatically evaluates and suggests improvements to your prompts. We've expanded to include chat template support and are seeking additional users for testing.


  • Comparing Feature: Bug Fixes and User Experience Improvement


  • Fine-Tune: Baserun now offers fine-tuning open-source models in collaboration with OpenPipe and AI Hero. Users can export their trace data as training data in JSON format. This format is compatible with OpenAI, OpenPipe, and PromptHero for fine-tuning purposes. Additionally, users can monitor and test their fine-tuned models using our comprehensive monitoring and playground features.


  • Eval & Guardrail: Advanced LLM Input Checks - Baserun now offers the ability to preemptively check LLM inputs for adversarial content to mitigate PII leakage or security risks before a completion is made.

January 15, 2024

January 15, 2024

January 15, 2024

January 15, 2024

Bring your own fine-tuned model to the Baserun Playground

  • Custom Model: Bring your own fine-tuned model to the Baserun Playground. SDK will be a quick follow-up.

  • Playground performance improvements and bug fixes: The Playground now loads five times faster. We've fixed multiple state management and scrolling issues bugs and improved error handling.

  • Playground UI Improvements: In the chat model, separate each test case into individual tabs.

  • New Python SDK Version 0.9.8b1 in Beta: We added x-request-id in the LLM request metadata.

January 5, 2024

January 5, 2024

January 5, 2024

January 5, 2024

Evaluating with human feedback

One major challenge in building products to automate more complex, nuanced tasks is that defining “good” and “bad” performance becomes a major bottleneck. While you may be able to define a few examples of “bad” behavior (e.g. AVs should try and avoid collisions, customer service chatbots should not reference stale information, medical chatbots should not misdiagnose patients), most of the interactions users will have with the system will be much more subjective.

Introducing the Human Evaluation feature, which enables AI teams to define their evaluation criteria and collect feedback from human annotators.

This functionality can be utilized for data labeling purposes, as well as for preparing datasets for fine-tuning and creating testing sets for future use.

Step-by-step guide

December 14, 2023

December 14, 2023

December 14, 2023

December 14, 2023

Introduce Baserun SDK 1.0

Introducing Baserun SDK 1.0:

  • Capture and analyze user sessions, traces, and LLM requests to gain insights into user behavior and identify potential issues with just 2 lines of code!

  • Employ automatic and human evaluation techniques to assess user experience objectively and subjectively. Move your manual review processes out of Google sheets and into a centralized tool for consistent, uniform notes.


  • Identify recurring patterns and root causes of issues to guide targeted improvements.

  • Craft "hillclimbing" test suites from production logs and examples you wrote to test specific scenarios and validate improvements. Build “regression” test suites to make sure you do not introduce unknown issues into your system.

  • Utilize built-in version-control, staging environments, and other features to ensure changes are deployed without introducing regressions.


  • Experiment with prompts, context, and more to optimize AI models and enhance product performance.

Read the full blog post here

A collaborative platform that enables engineers and product experts to build, monitor, and improve their AI.

© 2024 Mochi Labs, Inc. All rights reserved.

A collaborative platform that enables engineers and product experts to build, monitor, and improve their AI.

© 2024 Mochi Labs, Inc. All rights reserved.

A collaborative platform that enables engineers and product experts to build, monitor, and improve their AI.

© 2024 Mochi Labs, Inc. All rights reserved.

A collaborative platform that enables engineers and product experts to build, monitor, and improve their AI.

© 2024 Mochi Labs, Inc. All rights reserved.