Project Competition

The DataScience@UL-FRI Project Competition is an annual competition whose main goal is to connect students, academics, and industry. Over the course of a semester teams of students work on real-world data science problems guided by co-supervisors from academia and industry. The best teams are invited to present their work at the competition finals. The competition allows companies to showcase their challenges to academics and gives students the opportunity to gain hands-on experience in a structured but fun way.

Who can compete and why should you compete

Competition is open to all students, it is not limited only to students enrolled to the Faculty of Computer and Information Science, University of Ljubljana (FRI) and is not limited to University of Ljubljana (UL) students only. In other words, if you are a student you can compete, regardless from where you are.

First and foremost, through the competition you can gain a tremendous amount of invaluable knowledge. You will work on a real world problem that a top data science company in the region is facing. To get to a good solution, your work will be supervised both by top experts from the industry and academia. Secondly, by competing you can earn credit points for elective courses (Computer Science in Practice or Extracurricular Professional Activity). Do not worry, if you did not enroll into these courses this year you can do so next year and earn credits then! Note that credit points can be earned only if you are enrolled at FRI or if your UL study program allows you to take. Last but not least, by doing well you can win a nice prize!

Signup instructions

For FRI Data Science students

A team counts as a FRI Data Science team if it has at least one member from that study program. You can sign up as a team of 3, a team of 2, or solo. We will try to add solo applicants to other teams to reduce the amount of groups and mentoring workload. A team’s application should include names, emails, and student IDs of all team members. One member of the team should submit the application to jure.demsar@fri.uni-lj.si by Feb 2 2024.

For non FRI Data Science students

Teams that do not have any FRI Data Science students can also have at most 3 members. Just like in the previous case, we will try to add solo applicants to other teams to reduce the amount of groups and mentoring workload. Team’s application should include names, emails and student IDs of all team members. For non FRI Data Science teams a mandatory attachment to the application is a short (max 1 page) motivation letter. The letter should describe the reason you are applying and why you think you will do well in the competition. This is just a safety net to assure that we receive only serious applications. A lot of work is put into the competition both from the FRI side and from the industry, so we try to eliminate applications that only sign up and then never do anything or drop out of the competition after a week or two. One member of the team should submit the application to jure.demsar@fri.uni-lj.si by Feb 2 2024.

Prizes

All enrolled teams (with the exception of the 2nd year FRI Data Science students) compete for the three main prizes:

  • First place: 3000€,

  • Second place: 2100€,

  • Third place: 1500€.

Furthermore, every team can also win at most one special prize. These prizes are:

  • Prize for the best team composed only of undergraduate students: 1000€,

  • Best team composed of students that are not enrolled on a FRI programe: 1000€,

  • Best team composed of non Data Science students: 1000€,

  • Best team composed of 2nd year Data Science students: 1000€.

Timeline

  1. Info meeting: 12. 1. 2024, 13:00, https://uni-lj-si.zoom.us/j/99236444380.

  2. Application submission deadline: 2. 2. 2024.

  3. Competition start: 19. 2. 2024.

  4. Interim report: April 2024.

  5. Final report and a short presentation: May 2024.

Topics for the FRI Data Science Project Competition 2024

Zurich Insurance / ZCAM
A cross-market recommendation system in insurance

Recommendation systems are at the core of many services we use every day (YouTube, Netflix …). Their goal is to analyze historical data and behavior of clients in order to recommend new content that a user might like. In other words, they personalize user’s experience in order to serve content that is more likely interesting to a certain user.

Zurich is one of the largest insurance companies that offers a plethora of products. The goal of this task is to utilize modern machine learning techniques (such as transfer learning) across different datasets and multiple markets in order to develop a recommendation system that will predict customer’s needs and generate product suggestions tailored ot each user.

Keywords: recommendation systems, machine learning, insurance

Model interpretability

As the capabilities of machine learning models progress, so does their complexity. As a result a lot of models are more or less black boxes which tell us practically nothing about their reasoning, however amazing it might be. This is suboptimal as in a lot of cases understanding model’s decision or predictions cen be extremely useful. Zurich faces a similar problem with their models. Your goal here will be to study the modern literature about explainable AI to find approaches that could be useful for Zurich’s use cases. In the second part of this task you will then evaluate how useful these approaches really are.

Keywords: explainable AI, complex models, machine learning, black box approaches

Siemens
Tabular agents

Tabular agents represent a significant innovation for large enterprises like Siemens, which generate and manage massive volumes of tabular data through diverse internal systems such as ERPs and CRMs. These agents stand at the forefront of addressing the challenge of querying and interpreting this data efficiently using natural language. Several approaches and best-practices have emerged recently to optimize these tabular agents for correctness, accuracy, and relevance in their responses.

This project aims to review the state-of-the-art in LLM-based tabular agents by conducting a comparative analysis of the performance and limitations of different LLMs, both open-source and proprietary, as well as various development frameworks, and also taking into account size and complexity of different data sets and schemas.

Keywords: generative AI, LLM, tabular agents

Hybrid AI

Recent advances in Generative AI and LLM have shown that deep neural networks may be capable of doing simple reasoning. However, these models usually fail to tackle complex logical reasoning in order to determine causal relations and to generalize. The AI research community is split into two factions: those who believe that Artificial General Intelligence (AGI) can be reached through the sole progress of data-driven approaches (e.g., deep neural networks) and those who advocate the necessity of Hybrid AI, i.e., the combination of data-driven AI (machine learning) + symbolic AI (logic rules, constraints, and reasoning).

In the first phase of the project, you will have to survey emerging approaches in the domain of neuro-symbolic AI – e.g., logic tensor networks (LTN), logical neural networks (LNN), Graph Neural Networks (GNN) … With the focus given to practical applications and use-cases. In the second phase, you will have to compare data-driven and hybrid approaches on a use case provided by Siemens (e.g., recommender systems, semantic image interpretation …). Alternatively, based on the literature review, students can recommend their own use case against which both approaches will be evaluated.

Keywords: hybrid AI, neuro-symbolic AI, research

Outbrain
Efficient data sub-sampling

Efficiently conducting experiments with real-scale click-through-rate data can be cumbersome and time consuming due to sheer data sizes commonly encountered in the industry. A promising research direction in this area is to identify heuristic approaches that enable us to reduce the size of data as much as possible, while ensuring the information is preserved. Successful data pruning/simplification can have profound impact on times required to evaluate new models and/or features, and thus impacts the efficiency of whole research cycles.

Thus, this task revolves around approaches for efficient and informed data downsampling/simplification. The purpose of the task is to investigate whether recent promising approaches can be applied to real-life data (such as Outbrain’s use case). Therefore the goal is to identify one or more approaches that are simple-enough to operate at scale (preferably SQL-compatible), and can be shown to preserve the performance of a down-stream learning algorithm such as DeepFM or similar.

Keywords: deep learning, sampling, digital advertisement

State-of-the-art CTR and CVR Architectures

Outbrain uses state-of-the-art deep machine learning models to predict online actions, such as whether a website visitor will click on a displayed ad, whether a frequent website user will create an account, or whether an online shopper will purchase a suggested product. Accurately predicting probabilities of these actions is crucial, as they determine the value of each specific ad impression. As a result, machine learning models are at the core of Outbrain’s business. The choice of model architecture, in addition to proper optimization and training, is paramount and key to successful predictions. New and exciting prediction models are being unveiled on a monthly basis, utilizing the developments in transformers, interest modeling, capsule networks and other deep-learning areas.

The task of the students is to research, implement, develop and test various novel prediction algorithms and compare their performances against existing state-of-the-art models. Students will receive a large anonymized dataset resembling production data and access to Outbrain’s framework for training TensorFlow models incrementally. The main focus will lean on state-of-the-art architectures, such as deep&cross upgrades/variants, transformers and multi-task approaches.

Keywords: machine learning, deep learning, binary classification

Medius
Large and Deep models for Anomaly Detection

Medius is a custom software development company with main focus in AI projects. They use state-of-the-art machine learning for solving tasks such as customer segmentation, time-series forecasting, and data anonymization. This challenge comes from a collaboration with their client Hisense. The challenge revolves around maintenance minimization by utilizing anomaly detection in an industrial process line. The process line is composed of three wall presses equipped with various sensors monitoring temperature, voltage, vibrations, and more. You will be provided with a detailed, non-anonymized dataset about the pressing process and a full explanation of the workings of each sensor. Your tasks will be the following:

  • Generation of synthetic data for augmented model training.
  • Implementation of state-of-the-art self-supervised, semi-supervised or unsupervised machine learning methods for anomaly detection.
  • Evaluation on multiple unlabeled and labelled datasets.
  • Detailed explainability and sensor importance on different clusters of anomalies.

Keywords: multivariate time-series, anomaly detection, transformers, recurrent auto-encoders and GANs, self-supervised learning

In516ht
Creative/image similarity

This topic is a product of In516ht’s collaboration with Celtra. Celtra owns a platform for creating and publishing creatives (ads). Their main objective is to create scalable, top-performing creatives across different dimensions like channels, sizes and content. Celtra’s tool has integration with different social channels like Meta, Google, etc., but there is also a possibility for client to export the creatives and upload them by themselves, outside of Celtra’s toolkit. Using such a workflow is not optimal for Celtra as they lose the connection between the creative and the location where the ad was published (its social channel). Celtra can still get the information about the creative’s performance, but there is no link to a creative inside their toolkit even if the creative was created via their system. Because of this, the creatives inside their system might not have some crucial data about the their performance.

To overcome this, they would like to develop an algorithm that can inspect a creative that was published outside of their toolkit and link it with the creative inside the toolkit if they are the same (or similar enough). In other words, given an creative your goal is that for a given image (creative, ad) you have to find out to which Celtra creative that image is linked.

Keywords: computer vision, image similarity, advertisement

Chatbot for semantic video content search and recommendation

This topic is a product of In516ht’s collaboration with United.Group. United.Group is a streaming service provider that is present in 7 different countries with several different languages. They want to improve the way people of different demographics and different skill interact with the EON (TV box) platform when searching for content. Demographic is truly heterogenous, from a a Greek grandma searching for her favorite soap opera to a zoomer from Zagreb who can’t decide what movie to watch on a Saturday night.

The goal of this project is to use the data and meta-data about the content that is available for streaming through their platform and build a chatbot that a user could facilitate for getting content they might like. In other words, the goal of this project is to prepare a natural language interface where users can search for content with queries such as: “I want an action movie with robots.” or “Do you have any old movies with Elisabeth Taylor?”. Initial solution will most likely focus on a single language, however supporting multiple language is a big bonus.

Keywords: large language models, chatbot, recommendation system, search system, streaming

Transactional monitoring for anti-money laundering

Mukuru (South Africa) is one of the leading payment providers used for remittance services. Remittance is the process of sending money across borders, usually by migrant workers to their families or friends in their home countries. Criminals may use remittance channels to transfer illicit funds, evade taxes, or support illegal activities (a.k.a. money laundering and terrorist financing). To combat this risk, remittance providers like Mukuru, need to comply with various anti money laundering (AML) regulations and standards, which require us to conduct customer due diligence, record-keeping, reporting, and transaction monitoring.

Transaction monitoring is the process of screening and analyzing transactions for any indicators of suspicious or unusual behavior, such as:

  • Large or frequent transactions that are not typical for a customer or a corridor.
  • Transactions to or from high-risk countries or regions with weak AML controls, sanctions, or political instability.
  • Transactions that involve multiple accounts, intermediaries, or beneficiaries, or that are structured to avoid detection or reporting.
  • Transactions that do not match the customer’s profile, occupation, income, or expected activity

The goal of this task is to develop a data science enabled transaction monitoring solution that can help Mukuru with automating and enhancing their AML compliance. Specifically, the goal is to develop a machine learning model that can analyze and flag transactions or customer behaviour that may indicate money laundering or other financial crimes based on predefined and adaptive rules, scenarios, and thresholds. Key emphasis should be placed on models built on adaptive rules as those may represent scenarios we do not currently search for nefarious behavior, i.e. a model that can learn from historical data and identify patterns or outliers that deviate from normal behavior.

Keywords: anomaly detection, anti money laundering, time-series analysis, fraud detection

CREAPRO
Forecasting traffic density on selected motorway sections

Forecasting trends and density in motorway traffic is crucial for smart and informed infrastructure planning and a number of other practical applications (maps, navigation …) as well. Such forecasting is a typical example in the interesting field of time-series forecasting.

Your challenge in this project will be to forecast traffic density on selected motorway sections based on historic data. Core data consists of information about counts (or frequencies) of cars, average velocity, vehicle type, etc. However, inclusion of other attributes and external data, such as weather, holidays and vacation periods should bring a significant performance boost to your models.

Keywords: machine learning, times-series forecasting, traffic forecasting

Stock optimization of materials

Stock optimization of materials is the process of managing and regulating the supply, storage, and distribution of stock to ensure that the right products are available at the right time, while reducing inventory costs and minimizing the risk of excess materials. The goal of stock optimization is to help companies avoid stockouts and overstocking and minimize inventory holding and shortage costs.

Therefore, your job will be to collect the data on the materials you want to optimize, clean and conduct exploratory data analysis. Furthermore, you will develop a model using machine learning algorithms to predict future demand for the materials and use optimization techniques to determine the optimal stock levels for each material.

Keywords: machine learning, times-series forecasting, stock optimization

PC7
Recovering relational structure from a set of files

Users often keep their data in Excel, CSV, and other types of formats that do not contain or even support explicit information about the data (type, bounds, support, etc.) or relations (between different columns, tables, etc.). The main question that this project aims at answering is to figure out to what extent can we automatically recover this information.

First, we will survey what has already been done. Next, we will use or implement the most promising approach(es) and evaluate them. Finally, as a challenge, we will try to improve on existing approaches in at least one aspect. Relevant datasets will be provided by PC7.

Keywords: machine learning, relational data, natural language processing

One-size-fits-all relational database predictor

One of the most important practical problems in machine learning is being able to predict missing or future observations in relational data without much or any feature engineering and with relatively little data. Recently, with the development of graph neural networks and commercial tools such as Kumo.ai, it appears that we are getting very close to a viable solution.

The main goal of this project is to verify if this is indeed the case. In the first part you will have to identify plausible solutions. Next, you will implement and evaluate a proof-of-concept machine learning model that tackles this problem. Relevant datasets will be provided by PC7.

Keywords: machine learning, relational data, graph neural networks

DS@UL-FRI
Tampered image detection

As more and more business moves to the digital space and do not require our physical presence anymore, it is getting very important to have robust techniques for validating various documents and images that get sent via digital channels. The most important things is detecting whether these documents and images (e.g., ID photos) are genuine and have not been tampered with. By tampering we here mean that the original document or image was later changed with the help of a piece of software (e.g., Photoshop). The goal of this task is to develop a machine learning approach for tampered image detection in order to prevent fraud. A core part of the approach is identifying and verifying the authenticity of digital images to determine if they have been manipulated or altered in any way.

Keywords: computer vision, fraud detection, insurance, image tampering

Past Competitions and Awardees


Don't miss out on all the latest news and events!