DataScience

The Spark Sessions are a recurring event designed to facilitate the exchange of ideas and inspiration. The event consists of several snappy presentations, followed by a casual gathering with refreshments.

Interested in giving a talk? Apply here!

Our Spark Sessions Partners

Upcoming: Spark Sessions 004

When? June 10 2025 at 18:00.

Where? Faculty of Computer and Information Science, Večna pot 113, Ljubljana.

What?

On the Surprising Benefits of Vibe-Coding for Interdisciplinary Collaboration by Boshko Koloski (Young Researcher, Jozef Stefan Institute)

Stepping midstream, inter-disciplinary project—where documentation is sparse and data are disorganized—poses a significant challenge. Thanks to recent advances in LLMs, live “vibe-coding” workshops now offer a practical way to bridge those gaps. In this talk, I will present case studies from interdisciplinary collabs which real-time LLM-driven coding enabled teams to explore, prototype, and co-construct understanding on the fly, even under tight deadlines and uncertain conditions..*

Automated Assignment Grading with Large Language Models: Insights From a Bioinformatics Course by Pavlin Poličar (assistant, UL FRI)

Are LLMs ready to replace TAs and grade student assignments? In a blind study conducted at UL-FRI, we tested whether LLMs are able to grade student submissions and generate written feedback with the same accuracy and quality as human TAs. We compared six LLMs—both commercial and open-source—and found that, with the right prompts, LLMs can achieve human-level performance both in terms of grading and feedback.

A General Approach to Visualizing Uncertainty in Statistical Graphics by Bernarda Petek (PhD student and researcher, UL FRI)

Quantifying and communicating uncertainty is integral to scientific communication. Despite this, it’s often neglected. Without general methods or guidelines, practitioners must rely on niche, domain-specific techniques or create ad hoc solutions that are time-consuming, complex, and error-prone. All this narrows how uncertainty is taught and understood. In this talk, I will present our new general approach to visualizing uncertainty and bootplot, our open-source Python implementation.

Automating Value Investing: A Data Scientist’s Journey into Finance by Lidija Jovanovska (Senior Data Scientist, Sportradar)

Explore the intersection of data analysis and personal finance. This talk illustrates the use of Python for analyzing financial data, enabling programmatic stock screening and informed personal finance decisions.

Trials & tribulations of pre-flight scoring in online advertising by Vid Stropnik (Data Scientist, Celtra)

Discover how Celtra built its first pre-flight ad scoring model—turning ad design into a data-powered prediction game! From wrangling Meta’s graph API chaos to debating regressors vs. classifiers, we’ll explore the messiness of creative performance, why “good” is all relative, and what it takes to forecast ad success before it ever goes live.

Past Event: Spark Sessions 003

When? Feb 25 2025 at 18:00.

Where? Faculty of Computer and Information Science, Večna pot 113, Ljubljana.

What?

Game, Set, Match: Neural Networks in Tennis Video Analysis by Ivan Ivashnev (Senior Computer Vision Engineer, Sportradar) In this presentation, we’ll explore how AI and neural networks revolutionize tennis video analysis. We’ll dive into the models we use, our approach to video processing, and the insights we generate. From identifying player movements to extracting performance metrics, discover how technology is shaping the future of sports analytics.

ML in High Efficiency Production by Luka Androjna ( Cast AI, Senior Data Scientist / ML Guild Master) Going from an experimentation and model validation environment to using models in production is not a trivial task, especially when other constraints come into picture as well, like access to data, limited resources available for inference, latency, deployment method, etc. This talk will give a brief overview of such constraints and explain how they affect our choice in the modelling stage.

Increasing forecast accuracy via statistical inference by Živa Stepančič (Quantitative analyst, GEN-I) We will present the challenge of forecasting electricity demand in Sl energy market and how to strengthen our belief in model predictions. One can increase forecasting probability of energy demand by building a new prediction model, using ensemble models, using domain knowledge or statistical corrections. We are testing the last approach by determining the expected demand at different weather events through modeling prediction bias of weather variables and its effects on energy demand forecast.

TabPFN: Approximating Bayesian Inference with Transformers by Valter Hudovernik (Data Science Student at FRI) TabPFN (Tabular Prior-Data Fitted Network) is a transformer-based foundation model designed for tabular data classification. Trained to approximate Bayesian inference on millions of synthetic datasets, it leverages in-context during inference, enabling fast predictions without retraining. TabPFN outperforms or matches traditional models in accuracy and efficiency on smaller datasets. In this talk, we’ll explore its capabilities and discuss how to integrate it into data science workflows.

Pie Charts: An Apology by Erik Štrumbelj (Researcher at FRI, University of Ljubljana) For as long as I can remember, it has been my mission to point out that pie charts, while engaging, are terrible as statistical plots. To finally put the matter to rest, I studied 100 years of empirical results on pie charts. And while they are indeed very flawed in many ways, they also have some surprising qualities. I’ll share these, along with other insights into visualizing parts of a whole.

Past Event: Spark Sessions 002

When? Dec 10 2024 at 18:00.

Where? Faculty of Computer and Information Science, Večna pot 113, Ljubljana.

What?

Intro: Is There a Data Engineering Minimum for a Professional Data Scientist? by Erik Štrumbelj

Seeing Through Data: Eye Tracking Insights into Graph Comprehension by Leon Hvastja (Data Science Student, University of Ljubljana)

Eye-tracking technology has gained traction over the past decade, mainly due to its increased availability. Traditionally, graphs were evaluated through performance metrics, but eye tracking offers a deeper understanding of the underlying cognitive processes involved. This technology has broad applications, including in education to enhance learning, in understanding disabilities, building cognitive models of perception, and understanding the differences between expert and non-expert users.

Machine Learning and AI in Analysis of Neuroimaging Data by Grega Repovš (Professor, Department of Psychology, University of Ljubljana)

Neuroimaging processing and analysis works on large amounts of multimodal data in spatial and temporal domains resulting in a wide range of features at mulitiple levels of observation. Machine learning and AI are becoming important tools in optimizing data preprocessing, enabling novel analytical approaches, relating neuroimaging features to individuals’ cognition and behavior, and supporting diagnosis and treatment of brain diseases.

Blazing Fast Computation with JAX by David Nabergoj (Young Researcher, University of Ljubljana)

JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. We’ll see what makes it so much faster than PyTorch/Tensorflow, and how it achieves an over 4000x speed-up in reinforcement learning applications.

What’s It Like to Crawl 100M Web Pages Every Week by Roq Xever (CEO and Founder, PredictLeads)

I’ll talk about the infrastructure, costs, and challenges we face while crawling the web for information on over 70 million companies.

Using Knowledge Graphs to Improve Proprietary LLM-based Text Embeddings by Boshko Koloski (Young Researcher, Jozef Stefan Institute)

Semantic knowledge bases hold vast factual knowledge, but dense text representations often underutilize these resources. This work shows that augmenting LLM-based embeddings with knowledge base information improves text classification. Using AutoML and low-dimensional projections via matrix factorization, we achieve faster more accurate classifiers with minimal performance loss, validated on five LLMs and six datasets.

DATA_FAIR by Dunja Rosina (DATA_CONFERENCE organizer and initiator)

DATA_FAIR is a conference dedicated to fostering an inclusive environment for knowledge exchange, networking, and upskilling in data engineering and data science. I’ll briefly explain our purpose, what to expect in the next edition of the conference, and kindy invite you all to join us on Feb 13 2025.

Past Event: Spark Sessions 001

When? Oct 22 2024 at 18:00.

Where? Faculty of Computer and Information Science, Večna pot 113, Ljubljana (lecture room: TBA).

What? (click for a summary)

LLMDataForge: Framework that Leverages Large Language Models (LLMs) to Generate High-quality Datasets Tailored to Your Needs by Gal Petkovšek (Data Scientist at Medius) and Tadej Justin (Chief Data Scientist at Medius)
This talk explores generating high-quality synthetic datasets using LLMs, utilising the LLMDataForge framework for filtering and prompt adjustments to enhance training smaller models for NLP tasks.
One Transformation to Rule Them All: Automated Search for Feature Transformations at Scale by Mark Žnidar (Data Science Intern at Outbrain)
Automated feature search for field-aware factorization models, generating and evaluating many interactions. This method boosts efficiency and accuracy, enabling scalable, robust AutoML model search.
Generating Synthetic Relational Data by Valter Hudovernik (Data Science Student at FRI)
An introduction to the emerging field of relational data generation, focusing on the strengths and limitations of current methods in preserving the characteristics of the original datasets.
Weighing in on Evaluating LLM System Performance by Greta Gašparac (Data Scientist at Sportradar)
Since the release of ChatGPT, there has been a surge of interest in various LLM-powered solutions across industries. However, discussions on evaluating these systems deserve equal, if not greater, attention. In this talk, we explore the critical aspects of LLM system evaluation through a practical example of developing a customer support AI assistant.
Next-Generation AI for Intelligent Waste Management: Leveraging LLMOps and Semantic Entity Linking by dr. Stevanče Nikoloski (Head of Data & AI at Result)
We present a next-gen intelligent waste management solution powered by AI, focusing on semantic entity linking. Using LLMOps, we enhance data collection and fine-tune our LLMs (Mistral, Zephyr, Llama2, Falcon), ensuring local infrastructure compatibility and data privacy. The architecture integrates vector databases with advanced chunking (map-reduce, refine) for applications like summarization, chatbots, and entity linking, enabling continuous improvement and adaptability in waste management.
Value-Based Conversion Tracking in Online Controlled Experiments: Frequentist vs. Bayesian Approach by Aljiša Vodopija (Data Scientist at Outbrain)
This talk explores value-based conversion tracking in online controlled experiments, contrasting frequentist and Bayesian approaches to optimize revenue from differently valued conversions.

Don't miss out on all the latest news and events!