The DataScience@UL-FRI Project Competition is an annual competition whose main goal is to connect students, academics, and industry. Over the course of a semester teams of students work on real-world data science problems guided by co-supervisors from academia and industry. The best teams are invited to present their work at the competition finals. The competition allows companies to showcase their challenges to academics and gives students the opportunity to gain hands-on experience in a structured but fun way.
Who can apply?
Master’s and/or undergraduate students can apply. There are no restrictions on the education and prior experience of team members. However, the number of teams we can accept is limited and teams with demonstrated ability to perform better in the competition will be given priority.
Why should we apply?
First and foremost the projects are a great opportunity to gain hands-on experience working on real-world problems with professionals from industry and experts from academia.
Successful participation in the competition can also count towards your coursework (courses Računalništvo v praksi and Obštudijska dejavnost) and you might choose to extend the project work into a Master’s or Bachelor’s thesis.
And last but not least - there are prizes!
- 1st place [3000€].
- 2nd place [2100€].
3rd place [1500€].
Best non-DS Master’s team [1000€].
Best non-FRI team [1000€].
Best undergrad team [1000€].
How can we apply?
This year’s topics can be found below.
Data science students
Submit your application before 18th of February, 23:59, to email@example.com. It should include:
- names and IDs of team members (1-3 members),
- topic ranking (1 - best topic, 10 - worst topic).
Submit only one application per team.
This procedure applies only if all team members are not data science students. Submit your application before 18th of February, 23:59, to firstname.lastname@example.org. It should include:
- names and IDs of team members (1-3 members),
- topic ranking (1 - best topic, 10 - worst topic),
- a short motivational letter (max one A4 page).
Submit only one application per team.
This year’s list of topics
Interpretable prediction of ad campaign complexity (Celtra)
Celtra offers its clients (think advertising giants like Nike, Adidas, CNN, etc.) software that they use to produce advertising campaigns. A value proposition of Celtra’s software is that marketing campaign production is more efficient with Celtra. To make this claim possible, we measure campaign production effort as time spent in the software. Your task is to take the measured time and turn it into a prediction. You will be working with two datasets. One is a log of sessionized user events, from which you will derive the target variable (time spent in production). The other is a tabular description of produced marketing campaigns which will serve as your independent variables. These include campaign features, e.g. advertising channel (such as Facebook, Google, etc.), the number of ad creatives in the campaign, etc. Your task will be to build a machine learning model that can take as input a campaign description and output the expected time spent in Celtra’s software to produce this campaign. There are three core practical challenges behind this project:
- Explainability – the regression model needs to be interpretable since it will be used extensively when communicating with clients, we need to be able to say why a certain campaign is predicted to take more/less time to produce.
- Engineering – you will tackle real industry engineering techniques when working with user session data.
- Limited features – to deploy the model into production, your model needs to be able to take a small subset of all the campaign features when making the prediction. This way, the model can be used as part of sales and marketing (human prospects can enter the data themselves).
Keywords: supervised learning, regression, explainable AI, advertising
Using computer vision for labelling ad creatives (Celtra)
Celtra offers its clients (think advertising giants like Nike, Adidas, CNN, etc.) software that they use to produce advertising campaigns. Your task will be to develop a multilabel classification model, that can take an ad creative (image/animated creative/video) as input and output the features of the ad creative as labels. The labels will include object detection labels (what is on the image, e.g. human actor, shoes, logo, hammer, plant), as well as more nuanced and harder to classify labels, such as the mood of an actor (sad, smiling, etc.), design features (dominant colors, type of color palettes, contrasts, etc.), the share of ad real-estate covered by label, etc. The value of the multilabel classifier is multifold. It builds data that can, later on, be used for additional analytics (e.g. share of ads based on the label), product improvement (offer search-by-label to clients), new product development, etc. There are three core practical challenges behind this project:
- Partially labelled data – the data set you will be given will only have a minority of ads labelled. This will be a semi-supervised learning task.
- High data variance – there will be great visual differences between the ad images. For one, the advertised product-type will change based on the advertiser (it won’t always be Nike vs Adidas shoes). Also, each advertiser will produce their own specific ad designs (e.g. logos, brand-specific color palettes, etc.).
- Input format & size – data will be in multiple formats (video/image) and sizes (300x600, 900x1200, …), which results in interesting engineering challenges.
Keywords: computer vision, semi-supervised learning, multi-label classification, multimedia, advertising
Clustering Siemens customers and forecasting their purchasing trends (Siemens)
Siemens offers a wide variety of products to a wide variety of customers. The core of this project is a dataset that contains information about Siemens’ customers and their purchases of various Siemens products over the last year or so. The core dataset can be enriched by using external (exogenous) variables, such as customer’s industry, role, size, etc.
This project has two goals. The first one is to find a good way to cluster Siemens’ customers into meaningful groups based on their shopping habits and interests in various products. The second goal is to build forecasting models that are capable of predicting what kind of sales Siemens can expect in the future. With these models we should be able to answer questions like:
- How many products will customer X buy in the next month?
- How many units of product Y will we sell in the next week?
- How many units from product group Z (meaningful product grouping will be part of the dataset) will customer X buy in the next two months?
Note that both goals are connected as meaningful clustering from the first goal can be used to empower the solution for the second goal.
Keywords: clustering, customer behavior, time series, forecasting
A recommendation system for Siemens customers (Siemens)
Recommendation systems are currently one of the most hot topics in machine learning. Today a good recommendation system is almost a must have for any company that offers a variety of services to their customers, be it video streaming platforms, online music providers, gaming platforms or online shops.
The core of this project is a dataset that contains information about Siemens’ customers and their purchases of various Siemens products over the last year or so. The core dataset can be enriched by using external (exogenous) variables, such as customer’s industry, role, size, etc. The goal of this project is to develop a recommendation system that would empower Siemens’ process of cross-selling by offering their customers products and product groups they are interested in and thus more likely to buy.
Keywords: recommendation systems, purchase pattern analysis, customer behavior, shopping habits
Multi-task learning in real-time bidding (Zemanta)
Virtual ad-space is nowadays sold mostly through real-time auctions/bidding. These auctions happen in real-time while your browsers are still loading the webpage. Companies that participate at these auctions, such as Zemanta, commonly use different deep learning models for a range of prediction tasks. Examples of these include estimating ad viewability, predicting clicks or post-click events (conversions). These prediction tasks are traditionally implemented as independent models, a promising alternative is to model them as auxiliary tasks of a multi-task ML model. The intuitive advantage of a multi-task approach is knowledge sharing between predictions which could lead to performance increases.
The goal of the project is to design, implement and evaluate multi-task learning models for the above-mentioned tasks and compare them to task-independent models. For this purpose, Zemanta will provide a large anonymized dataset resembling our production data and access to our framework for training TensorFlow models (such as FM and DeepFM) incrementally.
Keywords: multi-task learning, deep learning, incremental learning, knowledge sharing, transfer learning
Using predictive uncertainty to improve click prediction models (Zemanta)
Estimating the predictive uncertainty of deep models is a popular and important topic in machine learning research. Models that can make predictions and quantify their uncertainty in them are useful in many tasks as they can tackle issues such as dataset shift very well.
Zemanta uses deep learning models to predict click probabilities on ads at a large scale. The goal of this project is to research models that can estimate their predictive uncertainty (e.g. Monte Carlo Dropout or Deep Ensembles) and see how using them improves model performance in settings where data might be corrupted or missing. The implemented models should be fast enough to compute thousands of predictions per second and able to train online and incrementally.
Students will receive a large anonymized dataset resembling our production data and access to our framework for training TensorFlow models (such as FM and DeepFM) incrementally.
Keywords: incremental learning, uncertainty, classification, deep learning, big data
Supervised and unsupervised calibration of smart light sensor measurements (Garex)
One of core Garex’s projects is a system for smart lighting of sustainable cities of the future. Given access to historical data for a group of geographically close sensors and the ground truth, can we calibrate future measurements to improve their accuracy? For example, given temperature measurements for several smart lights on the same street and gold standard temperature measurements at that location, can we produce a model that is able to correct future sensor measurements so that they will be closer to the ground truth when the ground truth is not available? And, if ground truth is not available, can we at least detect outliers or improve measurements by smoothing over all sensors? In this project we will emphasize simplicity over complexity and robustness over incremental improvements in accuracy. The end goal is to produce a compact solution in Python that performs only these 2 tasks: (1) learning a calibration model for a group of sensors and (2) calibrating new observations given a learned model.
Keywords: sensor calibration, supervised and unsupervised learning, estimating measurement error
Automated detection of similarity between business metrics (Databox)
The number of software tools required to run businesses is growing, as a result companies generate more and more data in more and more places. Databox’s mission is to make it as easy as possible for everyone in a company to monitor, analyze, and improve performance in one spot on any device through the Databox platform.
Every day the Databox platform processes thousands of business analytics metrics for tens of thousands of clients. An important question is: Are there relationships between these metrics and do they depend on client meta-data (industry type, employee count, etc.)? The goal of this project is to develop an algorithm that will automatically find strong relationships between time series of metrics. We will start with simple relationships like correlation and explore more complicated ones like dynamic time warping. Additional challenges are the inclusion of meta-data and dealing with the granularity of the data.
Keywords: business metrics, big data, metric relationships, correlation
Insurance customer segmentation (In516ht with Zavarovalnica Triglav)
Understanding ones customers is key for any business, a common technique companies apply to get additional information about their customers is to segment them into meaningful groups. The goal of this project is to use standard/basline and more modern, state-of-the-art unsupervised learning techniques to segment customers of an insurance company based on their properties. The project includes the preliminary phase of deciding which data to use in the segmentation. The segmentation(s) produced have to be visualized and summarized for business experts to assist them in producing a meaningful practical interpretation of the segments.
Keywords: unsupervised learning, customer segmentation, business insights, insurance
Automated customer email classification (In516ht with Zavarovalnica Triglav)
Responding to customer emails is an important part of any service. Often emails must first be assigned to a category and sometimes even subcategories before we can determine who should respond and how. Given a data set of thousands of categorized emails, can you use modern NLP approaches and machine learning to categorize these emails automatically? Which approach gives the best results, how good are the automated predictions, and are some categories more difficult to identify than others? Answering these and other similar questions is the main goal of this project.
Keywords: natural language processing, email categorization, customer support, supervised learning
Stock optimization based on sales forecasting (CREApro)
CREApro is a data science company that offers its expertise to external collaborators from various industries with the main goal of modernizing their business processes to allow for data empowered decision making. A common and important task for data driven decision making is sales forecasting. Sales forecasting is often an already established process in most retail/production companies as it allows companies a proactive, in advance planning of their future acitivities and as such serves as a basis for different important business decisions.
One of the most important and biggest challenges where sales forecasting comes into play is stock optimization. There data scientists try to find an optimal balance between stock value and out of stock situations. Both having too much goods in stock and running out of stock can be very costly for companies, so in an ideal world a company would have as little goods as possible in their stock, but just the right amount to never run out of stock. The goal of this project is to use provided sales data to develop an accurate model for forecasting sales and use these forecasts in combination with stock data to optimize stock reordering.
Keywords: sales forecasting, time series, optimization algorithms, purchasing, stock optimization