Computer science and informatics

Asst. Prof. Dr. Tome Eftimov

Tome Eftimov is a researcher at the Jožef Stefan Institute and assistant professor at the Jožef Stefan International Postgraduate School and University of Ljubljana. He received his PhD in 2018. His research focuses on statistical data analysis, optimization, NLP, machine learning, and AutoML. He has published over 160 papers and serves as Vice-Chair of the IEEE Task Force on Automated Algorithm Design and editor in Evolutionary Computation.

Research programme: Computer Structures and Systems
Training topic: Trustworthy Benchmarking and Explainable Evaluation of Large Language Models Across Tasks and Prompting Regimes

The programme “Trustworthy Benchmarking and Explainable Evaluation of Large Language Models Across Tasks and Prompting Regimes” addresses a central challenge in modern AI: how to evaluate large language models (LLMs) in a reliable, reproducible, and explainable manner across diverse application settings.

Although LLMs achieve impressive performance on many tasks, their outputs are highly sensitive to prompt formulation, task selection, domain characteristics, and evaluation metrics. Current benchmarking practices often rely on single aggregated scores or leaderboard rankings, which provide limited insight into when, why, and under which conditions a model performs well or fails.

This programme develops a methodological framework for trustworthy benchmarking that includes:

  • systematic analysis of prompting regimes (e.g., zero-shot, few-shot, chain-of-thought),
  • multi-task evaluation across classification, generation, and reasoning tasks,
  • statistically sound model comparisons,
  • assessment of robustness, stability, and generalization.

A core objective is to move beyond simple model ranking and toward structured performance mapping across tasks and prompting strategies. The programme focuses on identifying and quantifying the factors that drive performance differences, such as prompt length, instruction structure, domain shift, and task complexity.

By integrating statistical rigor with explainable AI techniques, the programme aims to strengthen transparency, reproducibility, and trust in LLM evaluation. The outcomes will support evidence-based model selection and responsible deployment of LLMs in research, industry, and public-sector applications.