Asst. Prof. Dr. Tome Eftimov

Research programme: Computer Structures and Systems

Training topic: Trustworthy Benchmarking and Explainable Evaluation of Large Language Models Across Tasks and Prompting Regimes

The programme “Trustworthy Benchmarking and Explainable Evaluation of Large Language Models Across Tasks and Prompting Regimes” addresses a central challenge in modern AI: how to evaluate large language models (LLMs) in a reliable, reproducible, and explainable manner across diverse application settings.

Although LLMs achieve impressive performance on many tasks, their outputs are highly sensitive to prompt formulation, task selection, domain characteristics, and evaluation metrics. Current benchmarking practices often rely on single aggregated scores or leaderboard rankings, which provide limited insight into when, why, and under which conditions a model performs well or fails.

This programme develops a methodological framework for trustworthy benchmarking that includes:

systematic analysis of prompting regimes (e.g., zero-shot, few-shot, chain-of-thought),
multi-task evaluation across classification, generation, and reasoning tasks,
statistically sound model comparisons,
assessment of robustness, stability, and generalization.

A core objective is to move beyond simple model ranking and toward structured performance mapping across tasks and prompting strategies. The programme focuses on identifying and quantifying the factors that drive performance differences, such as prompt length, instruction structure, domain shift, and task complexity.

By integrating statistical rigor with explainable AI techniques, the programme aims to strengthen transparency, reproducibility, and trust in LLM evaluation. The outcomes will support evidence-based model selection and responsible deployment of LLMs in research, industry, and public-sector applications.