DataScience

The DataScience@UL-FRI Project Competition is an annual competition whose main goal is to connect students, academics, and industry. Over the course of a semester teams of students work on real-world data science problems guided by co-supervisors from academia and industry. The best teams are invited to present their work at the competition finals. The competition allows companies to showcase their challenges to academics and gives students the opportunity to gain hands-on experience in a structured but fun way.

All students that are interested in competing in the 2025 edition of the Data Science @ FRI Project Competition, are invited to the event’s introduction meeting that will take place on 17. 1. 2025 at 12:00 over Zoom (https://uni-lj-si.zoom.us/j/96173479617).

Topics for Project Competition 2025

ZCAM

Tampered image detection

As more and more business move to the digital space it is getting very important to have robust techniques for validating various documents and images that get sent via digital channels. The most important things is detecting whether these documents and images (e.g., ID photos, incurance damage photos …) are genuine and have not been tampered with. By tampering we here mean that the original document or image was later changed with the help of a piece of software (e.g., Photoshop) and does not fully represent reality. This is especialy important for very senstivive industries, such as insurance where clients claiming insurance might want do trick the insurance company (Zurich Insurance in our case) by providing fradulent tampered images of the damage.

The goal of this task is thus to develop a machine learning approach for tampered image detection in order to prevent such fraud. At ZCAM, they identified some tools that look promising. In the first phase of the project, ZCAM engineers will present to you the background along with tools and approaches that look promising. You will then expand this set of tools by performing an additional search of the literature. Next, you will have to evaluate the approaches and find those that are suitable for their use case (perform well and allow commercial use). A core part of the approach are computer vision techniques for identifying and verifying the authenticity of digital images in order to determine if they have been manipulated or altered in any way.

Keywords: computer vision, fraud detection, insurance, image tampering

Getting reliable confidence scores for transformers

At ZCAM, they are currently working on several projects aimed at automating various back-end processes using transformer-based models. These processes primarily involve operations on diverse document types. Despite automation, human involvement remains integral as a validator of the models’ outputs. By improving their ability to assess the quality and reliability of model predictions, they could significantly accelerate these workflows and, in some cases, potentially eliminate the need for human intervention entirely. The primary focus of this project is to explore methods for obtaining reliable confidence scores in transformer-based models, with applications tailored to our specific challenges. Key research questions include:

How can we derive robust probabilistic confidence scores that accurately reflect a model’s certainty in its predictions, especially for text-based inputs? Current confidence measures often fail to align with real-world reliability, necessitating the exploration of innovative techniques.
Transformer models frequently overestimate their prediction quality. What strategies or calibration techniques can ensure that confidence scores genuinely reflect the likelihood of correctness?
In scenarios where no clear or correct answer exists, how should models convey uncertainty? This aspect is critical for fostering trust in model outputs, particularly in high-stakes applications.
For tasks such as multi-label classification or text generation, how can we develop a single, interpretable confidence score that captures the overall reliability of the model’s outputs? This is especially relevant for complex tasks requiring nuanced judgment.

The outcomes of this project will provide valuable insights into the trustworthiness of model predictions, enabling ZCAM to optimize workflows further and potentially reduce human involvement in processes where automation reliability is sufficiently high. Ultimately, this research aligns with their broader goal of leveraging AI to enhance operational efficiency while ensuring reliability and transparency.

Keywords: deep learning, transformers, confidence scores, model uncertainty

In516ht

Demand forecasting for one of the leading furniture providers in the Gulf region

Alghanim Industries, one of the largest privately-owned companies in the Persian Gulf region, operates Safat Home, a premier home furnishings retailer founded in 2004. Safat Home offers a diverse range of furniture, décor, garden accessories, sanitaryware, and tiles, catering to evolving customer preferences. In this project, participants will develop machine learning algorithms for demand forecasting, leveraging datasets that reflect key business factors:

Price & Promotions: Assess the impact of pricing strategies and promotional campaigns.
Seasonality: Account for demand fluctuations during key periods such as Ramadan, year-end, and back-to-school seasons.
Product Lifespan: Analyze demand trends based on product age in Safat Home’s frequently refreshed catalog.
Product Attributes: Incorporate attributes like color, fabric type, style, and patterns that influence customer choices.

This project offers a real-world opportunity to explore the dynamics of demand prediction in the home furnishings industry, enabling data-driven decision making.

Keywords: data-driven decision making, time-series forecasting, demand forecasting, pricing strategies

Developing a ML-driven recommendation solution for iGaming

Games Global is a leading provider in the iGaming industry, specializing in online casino games and advanced technology solutions. As part of this data science competition, the objective is to design and develop machine learning (ML) algorithms to recommend games tailored to specific operators and markets. The core task in this project is the creation of a system that facilitates the data about unique characteristics of specific operators (e.g., William Hill) and markets to provide tailored recommendations of the most suitable games. There are many possible upgrades to the developed recommender system:

Forecasting about how well each recommended game is likely to perform, considering the markets where the game is live and the attributes of similar games.
Provide explanations about why recommeded games are expected to succeed.
Integrate the developed machine learning model with a user-friendly application (such as a Streamlit app).

To summarize, you final result is a recommender system that accurately identifies the most suitable games for a given operator or market and provides actionable insights to enhance decision-making and optimize game portfolios.

Keywords: machine learning, recommender systems, iGaming, game recommendations

Medius

Unveiling transportation modes with mobile positioning data

Medius is an award-winning Slovenian software engineering company & technology solution provider that helps companies achieve higher business impact by implementing innovative data-driven technologies that cannot be bought off the shelf. They use state-of-the-art machine learning and data science analysis approaches to solve tasks such as customer segmentation, time series forecasting, data anonymisation, and others. Their challenge this year focuses on detecting transportation modes with mobile positioning data (MPD). The dataset you will be given covers approximately one-third of all mobile network traffic in Slovenia. The data is fully anonymized, in a log-like format, with three data parameters, timestamp of an event, device ID (changed daily), and geo-localized cone where the device was most likely present.

Participants may also be provided with external data, such as:

Maps/Geographical Data: Including road networks, public transport routes, and points of interest (POIs).
Weather Data: Historical weather conditions for the timestamped events.

Participants will have to identify the dominant mode of transportation for each user during a specified period. Transportation modes may include: on foot, bike, moped, car, bus, truck, train. This challenge tests participants’ skills in preprocessing real-world noisy data and designing robust ML pipelines to predict user behaviour, with potential applications in urban planning and personalized services.

Keywords: transportation mode detection, call detail records (CDR), geo-localization, mobility analysis, spatio-temporal data, feature engineering, handling noisy data, urban planning

Outbrain

Graph neural networks for conversion rate estimation

Online advertisements traditionally focus on estimating the probability of a user clicking on an ad (CTR prediction). However, recent industry trends have shifted towards predicting the follow-up actions of a user after a click, known as conversions. These conversions, which can vary widely (e.g., landing on a page, keeping the session open for a specified time, filling out a questionnaire, adding an item to a basket, etc.), occur on the advertiser’s web page and make the conversion prediction (CVR prediction) problem particularly challenging.

To tackle this complexity, we propose modeling the data as a graph, where publishers (websites or mobile apps) and conversions are represented as nodes, and the performance of conversions on a publisher is represented with a weighted link. Our project aims to leverage Graph Neural Networks (GNNs) to predict conversion performance on specific publishers. By utilizing GNNs, we seek to capture complex relationships within the data and provide accurate predictions of conversion performance for various publishers.

Keywords: digital advertising, deep learning, conversion prediction, big data, graph neural networks

Leveraging batch size for different model architectures and optimizers

Batch size corresponds to the amount of simultaneously considered instances of data. In a real-life setting, batch sizes determine training speed, while, at the same time, offer interesting tradeoffs in terms of predictive performance. Understanding scaling-laws for batch size, and possible relations with other hyperparameters is an under-explored area. Findings of such research would enable analytic (closed-form) determination of optimal batch sizes for a given use case/model, and potentially help study general model behavior. One such use case would be to (greatly) reduce the model complexity (number of parameters) while preserving the performance, all by setting the right batch size and other hyperparameters.

Keywords: digital advertising, deep learning, big data, batch size, hyperparameter optimization

PredictLeads

Finding the most similar logo

Identifying a company’s presence on their or another company’s website is an important component of collecting business information. Companies often appear with their logo, so the logo is an effective identifier, if we match it correctly. However, different companies have similar logos, logos have different variants, and logo images can be in different sizes and formats, raster or vector, which all make automating the identification of logos more difficult.

The main goal of the project is to implement one or more approaches to estimating logo similarity and a framework for evaluating them. Because logo identification is a very frequent task, computation and memory constraints also have to be taken into account when designing a solution. The students will be provided with a partially labelled dataset of ~100M logo images.

Keywords: predictive machine learning, computer vision, similarity search

Classifying companies by industry

A company’s industry classification is an important piece of information, for example, when searching for potential customers and competitors. One commonly used standard is the North American Industry Classification system (NAICS): https://www.naics.com/search/. Unfortunately, it is only available for a limited number of companies, so an efficient and effective way of categorizing the millions of other companies would be very beneficial.

The main goal of the project is to implement one or more approaches to assigning a NAICS code to a company, based on the structured and unstructured information provided on their website (page texts, jobs, news). The proposed approaches also have to be thoroughly evaluated, providing insights into parts of the classification hierarchy where they work well and parts where they fail. The students will be provided with a dataset of at least 30k company websites with industry labels.

Keywords: predictive machine learning, taxonomy/hierarchy classification, natural language processing

DS@FRI

Who’s a (the) good doggy? Identifying dogs from video

Video-based re-identification of people is an important practical task and area of computer vision. However, relatively little has been done for animals and even less for dogs in particular. Can we apply similar solutions to re-identify dogs from video? Do we need a different approach? Can it even be done?

Our task will be to review the field and implement a robust solution, hopefully from components that are already readily available. As part of the project, we’ll also aim to compile and publish a (world first?) benchmark dataset for dog re-identification from video.

Keywords: machine learning, computer vision, video, puppies

Past Competitions and Awardees

Don't miss out on all the latest news and events!