Project Competition

The DataScience@UL-FRI Project Competition is an annual competition whose main goal is to connect students, academics, and industry. Over the course of a semester teams of students work on real-world data science problems guided by co-supervisors from academia and industry. The best teams are invited to present their work at the competition finals. The competition allows companies to showcase their challenges to academics and gives students the opportunity to gain hands-on experience in a structured but fun way.

Interested? If you are interested, tune into the Project Competition 2021 introductory event on Zoom. The event will take place on 15. 1. 2021 at 9:00. If you have any questions, feel free to join the official Slack workspace - users with faculty or student email can join without an invitation.

 

Who can apply?


Master’s and/or undergraduate students can apply. There are no restrictions on the education and prior experience of team members. However, the number of teams we can accept is limited and teams with demonstrated ability to perform better in the competition will be given priority.

 

Why should we apply?


First and foremost the projects are a great opportunity to gain hands-on experience working on real-world problems with professionals from industry and experts from academia.

Successful participation in the competition can also count towards your coursework (courses Računalništvo v praksi and Obštudijska dejavnost) and you might choose to extend the project work into a Master’s or Bachelor’s thesis.

And last but not least - there are prizes!

 

How can we apply?


Pay attention to the fact that the application process is slightly different for teams that consist only of non-data science students. After you apply the competition is the same for all participants.

Submit your application until Feb 5th 23:59. The application must include:

  • A list of 1-3 team members with contact information. 1-member teams will most likely be merged with other teams.
  • Topic bidding: Rank all topics listed below according to preference from 1 (most preferred) to N (least preferred), where N is the total number of topics. We will take these rankings into account when assigning topics to teams.
  • Only teams that have no Data Science Master’s students! A short motivational letter (max 1 page) describing your motivation for participating in this competition.

Submit the application to jure.demsar@fri.uni-lj.si. Only one member of the team should make the submission!

 

Project topics


(1) Explaining why certain ads require corrections (Celtra)

In practice, it often happens that already published ads require several corrections before advertisers are happy with them. The goal of this project is to find out if we can explain why some ads require these corrections and why others do not. You will be given a dataset containing information about ads. Besides the information about how many correction passes were executed on an ad the dataset will include data about other ad size, language, distribution channel, text, image … Your goal will be to use these features for explaining how they influence the number of correction passes. You will have to use a machine learning algorithm that can be explained, the final goal of these tasks will be to visualize and present your findings in a way that will be easy to understand for non data science experts.

Keywords: explainable AI, advertising, visualizations

(2) Predicting the influence of COVID pandemic on the advertisment industry (Celtra)

The goal of this task is to figure out where we can use existing dynamics of the COVID pandemic and their influence on the advertisement industry to predict what will happen in the future. You will be given a dataset that is representative of the advertisement industry dynamics during the COVID pandemic on the global scale. This dataset contains information about the number of produced and published ads per day, week, month … Your task is to build a model that is capable of predicting future dynamics in the advertisement industry. To train the model you will have to use publicly accessible data (e.g. Twitter sentiment, news keywords, trends in pandemic …). Since you are working with time sensitive data, you have to acknowledge all the limitations that arise in time series modelling.

Keywords: COVID, advertising, time series prediction

Lexpera offers a legal information portal that provides users with access to tens of thousands of legal documents. Finding relevant information and new, interesting reads can be messy given such a large document collection. That is why they would like to provide their users with a document selection tailored specifically for them. The goal of this project is to explore recent deep learning and graph neural network approaches to recommendation systems and evaluate their performance on their specific domain. To do so, students will be provided with a large anonymized dataset of users, documents and their interactions collected over the course of last year.

Keywords: recommendation system, legal documents, graph neural networks

(4) Time series forecasting with exogenous variables (Siemens)

There are many methods for time series forecasting, however, much fewer are designed to allow for exogeneous variables. The objective of the project is an updated survey of time series forecasting with a focus on methods that allow for exogeneous variables. Empirical validation (on toy and real-world data) should be used to illustrate the advantages and disadvantages of the surveyed methods.

Keywords: time series, forecasting, survey, machine learning, statistics.

(5) Event detection in audio signals (Siemens)

Recording and analyzing audio from the environment is an important and cheap methodology to detect events. An event can be defined as an unexpected sound from the environment, a machine, or a human (unsupervised learning), or as a predefined sound pattern (supervised learning). It can be an abrupt short-term sound (e.g., something falling on the floor) or a long-term variation (e.g., a motor slowly deteriorating). There are several methodologies for the analysis of a recorded stream of audio samples (a wav file) and both Python and R provide a large set of libraries for such a purpose. Some methodologies rely on the extraction of features directly from the waveform in the time domain. Other methodologies require a representation of the signal in the frequency domain, hence they first apply an Short-Term Fourier Transform (STFT) to the signal and consider each frequency component as feature, or as an independent timeseries. This task consists of a review of the available methodologies. Example sound for practicing and testing can be easily generated using off-the-shelf microphones/smartphones.

Keywords: digital signal processing, audio signal, event detection

(6) Automatic music composition with MuseNet (Siemens)

MuseNet is a model developed by OpenAI, which given a primer (a short sequence of notes) can make predictions on the most suitable next notes to complete a piece of music (once style and instruments are specified). MuseNet is based on the GPT-2 transformer model, which is mainly used for Natural Language Processing. The task consists in providing an overview on how the transformer model works in general, its advantages and disadvantages compared to recurrent neural nets (RNNs) and long-short-term-memory (LSTMs) based models. The team shall also explain how to adapt the MuseNet concept to be applied directly to music sound (symbolically represented in MIDI format), rather than to music text.

Keywords: music composition, MuseNet, transformer model

(7) Evaluating model tradeoffs for click prediction (Zemanta)

At Zemanta, models for predicting clicks are trained on millions of ad impressions every hour. To be able to process large amounts of data, their models have to be fast. Of course, they also want their models to make good predictions, which often means increasing their complexity. The goal of this project is to research the tradeoff between model quality and speed, and find the best model that is still fast enough for such usecase. The starting point will be factorization machines (FM) and DeepFM (FM + deep neural networks) models. Some of the models we would like to evaluate are field-aware FMs and field-weighted FMs along with their DNN-enhanced versions - of course, students are encouraged to also propose and test a few of their own models. Students will receive a large anonymized dataset resembling production data and access to a framework for training TensorFlow models incrementally for easier implementation (with FM and DeepFM code for reference).

Keywords: incremental learning, classification, machine learning, statistics, big data

(8) Clustering web users based on mouse movements (Zemanta)

When we navigate the web there are a lot of patterns - how fast are we moving the mouse, where do we pause to read, do we follow the text with the cursor as we read, do we rage-click, etc. You will be given some anonymized sessions with a timeseries dataset of mouse movements and some supporting information. This is a research project with a goal to see what can be derived from mouse movements. Students will get an opportunity to work with timeseries data, do a lot of data mining and preprocessing, construct interesting features, use it to do clustering, apply other unsupervised learning methods and create useful visualizations. The project is designed to be quite open-ended and students will closely collaborate with Zemanta’s data scientists.

Keywords: clustering, unsupervised learning, data mining, statistics, analytics

(9) How good is my plot? (DS@FRI)

Statistical graphs, such as bar charts, box plots, and scatterplots are staples of a data scientist’s work. Can we build a machine learning model that that “looks” at an image of a graph and provides feedback on its quality? The idea is to scrape the internet for thousands of different statistical graphs, utilize Amazon Mechanical Turk for annotation, and use deep learning to learn the relationship between the graph and its properties, such as type, aesthetic quality, labeling of axes, sizing, etc. Identifying relevant and quantifiable properties of statistical graphs is an important subtask and the data we gather will allow us to determine if there even is a consensus about what is a good graph. And, most importantly, answer the most important question of all - Are pie charts bad?

Keywords: visualizations, graphs, image recognition/analysis, crowdsourcing

 

Past Competitions and Awardees


Don't miss out on all the latest news and events!