Responsible Evaluation of Machine Learning Systems

From June 2 to 6, 2025, the intensive course “Responsible Evaluation of Machine Learning Systems” was held at the National University of Córdoba (UNC). It was organized by Luciana Benotti, professor at the School of Mathematics, Astronomy, Physics and Computing (FAMAF-UNC) and leader of the AI ethics team at Fundación Vía Libre.

At the end of this article, you’ll find the full video recordings of the course sessions, presentation slides, and links to the repositories and resources used throughout the course.

The course was led by Luciana Ferrer, an independent researcher at the Institute for Computer Science (ICC), UBA-CONICET. Luciana earned her PhD in Electrical Engineering from Stanford University in 2009, and her undergraduate degree in Electronic Engineering from the University of Buenos Aires in 2001. Her research focuses on machine learning applied to speech and language processing. She leads the speech processing group at the ICC, which is part of the Laboratory for Applied Artificial Intelligence (LIAA), working on topics such as pronunciation scoring for language learning, mental state recognition, detection of neurological diseases from speech, and uncertainty quantification for language models. She has published over 170 scientific papers with more than 7,700 citations.

Course Overview

“The evaluation protocol is the most important component of the development process for a machine learning system. If the protocol is flawed or does not reflect the needs of the target application, development decisions may be suboptimal and predictions of system performance may be inaccurate.”

This technical training focused on key aspects of evaluation protocols for AI systems, including data handling, performance metrics, and statistical significance. While some of the content applies broadly to any type of machine learning system, the course emphasized evaluation metrics, particularly in the context of classification tasks.

Special attention was given to proper scoring rules (PSRs), a family of metrics developed decades ago for evaluating probabilistic classifiers—systems that make decisions based on posterior class probabilities, which are increasingly common today. The course covered in detail two specific PSRs: Bayes risk and cross-entropy. Although these metrics are well-studied in statistical literature, they are underused in machine learning performance evaluation. The course also addressed expected cost, a generalization of error rate that aligns with Bayes risk when decisions are made using Bayesian decision theory.

Throughout the course, it was argued that PSRs and expected cost are the only necessary tools to properly evaluate classification systems, and that commonly used alternatives like F-score (for categorical decisions) and expected calibration error (for posterior probabilities) are either misleading or insufficient.

In addition to helping organize the course, Vía Libre’s team participated as attendees, as part of their ongoing training efforts. The knowledge and tools explored during the course will be integrated into our educational programs for critically engaging with artificial intelligence in teaching and learning environments.

Watch, pause, and replay—class by class

Below you’ll find the complete course videos, organized into five sessions, each with two segments. Slides and supporting materials used in each class are also available.

Monday, June 2

01 - Introducción - Curso Luciana Ferrer.   02 - Datos de evaluación - Curso Luciana Ferreruevo

Tuesday, June 3


03 - Teoría de decisión y función de costo - Curso Luciana Ferrer 04 - EC vs otras métricas - Curso Luciana Ferrer

Wednesday, June 4:



05 - Proper scoring rules - Curso Luciana Ferrer06 - Calibración - Curso Luciana Ferrer

Thursday, June 5


Friday, June 6



07 - Intervalos de confianza - Curso Luciana Ferrer

Core references for the course:

Este curso se basó principalmente en los siguientes artículos:

  • L. Ferrer, No Need for Ad-hoc Substitutes: The Expected Cost is a Principled All-purpose Classification Metric. TMLR, 2025
  • L. Ferrer and D. Ramos, Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration. TMLR, 2025

Repositories mentioned during the course


* Note on Notation:

Please note that the notation used in the course slides differs from that in the referenced papers. In the slides, K denotes the number of samples and N (sometimes C) the number of classes, while in the papers this is reversed. This change was requested by a reviewer during the paper submission process and was not updated in time for the course materials. We appreciate your understanding.

Archive