Projects

Michalis Xydas

DeepSqueeze: Deep Semantic Compression for Tabular Data

DeepSqueeze is a deep learning technique used for table compression. It utilizes correlations between columns that traditional compression algorithms, like gzip and parquet, do not take into account. The authors employ the autoencoder architecture to capture the correlations and project each row to a smaller dimension space. We reverse-engineer this implementation managing to achieve similar results confirming its reproducibility on the 2 datasets they used. Then, we make additions and improvements like experimenting on a new real use-case dataset, improving the materialization step, and tuning the architecture hyperparameters in our attempt to reproduce and improve the compression ratio, our main evaluation metric.

Keywords: table compression, dimensionality reduction, autoencoders, reproducibility

Christina Borovilou

Learning Multi-dimensional Indexes

Even nowadays systems do not take full advantage of their capabilities - methods are universal independently of the nature of the Database.
Building Flood, a multi dimensional index that can recognize the Database characteristics as well as users' needs over it, we can exploit these insights by adapting the "indexing procedure" to these parameters and make the Queries fly!

Alexandros Kalimeris

Optimal Histograms with Outliers

Histograms are a traditional and well-studied way of summarizing data. They have been extensively used in many applications that require frequency estimates, such as query plan cost estimation. In my project, I will be reviewing the paper "Optimal Histograms with Outliers" by Rachel Behar and Sara Cohen that is examining the creation of v-optimal histograms while allowing for the deletion of possible outliers in the dataset. After thoroughly studying the paper, I will recreate the most important experiments that are presented in the paper, as well as perform the same experiments on new datasets. All the algorithms described in the paper and used in the experiments have been implemented in a compact python library.

Katerina Xagorari

DeepDB: Learn from Data, not from Queries!

DeepDB is a powerful pure data-driven database extension that learns important characteristics of the data stored inside the database. Once it learns all the characteristics that it deems important, it builds a tree-like structure, RSPNs, that can now handle any operation that would be done in the actual database, from approximate and actual query answering, to cardinality estimation and other machine learning tasks. To get this work a step further, I decided to retain their offline adaptation and create indexes based on the data themselves. Indexes, when used correctly, can be an extremely powerful tool to data retrieval in a database. They can accelerate data retrieval immensely. To build these indexes, I not only utilised the actual DeepDB tool, but also tried to locate any inherent column correlations amongst the database tables. This work was inspired by the Hermit Indexing Algorithm.

Maria-Sofia Georgaki

Qd-tree: Learning Data Layouts for Big Data Analytics

Due to the huge size of data produced by modern applications and the need to process and analyze them, partitioning has become necessary. Query Data Routing Trees(QdTrees) is a new framework that helps partition data by taking into account the metric of the numbers of blocks skipped, which relates to I/O costs and the efficiency of queries. Given a specific set of queries, a QdTree is constructed with the purpose of optimizing the aforementioned metric. A QdTree is used in both partitioning data as well as enhancing queries to take advantage of the completed partitioning. One of the ways proposed to construct such trees is Deep Reinforcement Learning, which leads to gradually better trees using Reinforcement Learning while taking advantage of Deep Learning to address high dimensionality.

Dimitris Reppas

SQL Query Completion for Data Exploration

Nowadays, it is more and more frequent to have several databases available for one application domain, some of them with hundreds of tables and/or attributes. Reaching the data of interest in these cases, by writing SQL Queries could be challenging, especially for a new data scientist. So it is obvious, that further exploratory analysis of data is crucial and must always come before start posing queries to databases, which store large scale data. Having in mind that Jupyter Notebook is a web-based environment that enables interactive data analysis and visualizations, and by being influenced by [1], [2], a notebook is developed which brings together effective tools for data exploration. Τhe first part of the project is dedicated to the presentation of the visualization tools which are used in the notebook for Exploratory Data Analysis (EDA). By using these tools, one can visually analyze numerical and categorical attributes of a Query in order to understand the data and the relationships between the attributes. The second part of this project is referred to the importance of a library called DataComPy6, which can be used to compare two Queries, in order to find similarities and overlaps. Of course, this tool is included in the developed notebook as well, in order to further improve the exploratory procedure. Finally, the main focus of this project relies on the report and implementation of the notion of ‘SQL query completion’. This idea is presented in detail in [1]. Based on this idea, the main contribution of this project is first to implement this notion and second to release an improved approach. As it will be shown, our approach is an improved version of ‘SQL query completion’ and of course this tool frames the notebook as well. The combination of the aforementioned tools is about to create a powerful notebook for data exploration, that will help a user in query formulation. It is worth mentioning, that the databases that were used in the experiments of this project, in order to evaluate the usage of the notebook are nba_salary.sqlite and database.sqlite.

Anna Mitsopoulou

Duoquest: A Dual-Specification System for Expressive SQL Queries

Querying a relational database is difficult for users without knowledge of SQL language. Many systems nowadays try to overcome this difficulty and provide alternatives for those users. One of them is Duoquest, a system that consumes both a natural language query and a table to produce candidate SQL query translations, and achieves significant improvement over a state-of-the-art Natural language interface. In this work we take a closely look at the reasons of this improvement and we try to achieve even better results without further user's effort.

Stelios Kotzastratis

Dynamic Query Refinement for Interactive Data Exploration

Efficient data querying and exploration are often hindered by the user's poor dataset awareness or overly strict queries. Instead, a query can be dynamically refined when needed, to provide an adequate number of results based on user needs. An overflow of returned data is avoided by carefully restricting the query to the best results, while a scarce result set can be enriched by relaxing the specific constraints expressed on the query. Following a work of Kalinin et al., we saw a system that implements such a refinement in the context of Constraint Programming queries and Searchlight. We further explore this idea by replacing their proposed SciDB underlining system with PostgreSQL, offering an open-source implementation of the algorithm described in the original paper.

Antonis Papadakis

Facilitating SQL Query Composition and Analysis

One of the more challenging tasks in the broader technology community is communicating efficiently with a system and in particular with a database. Database – person communication is very frequently achieved through the use of SQL queries. Even though SQL queries can be rather simple to formulate and easy to utilize for users familiar with computer science, inexperienced users might require several cycles of tuning and execution to reach the desired output. In this work we are based on the paper from Zainab Zolaktaf et al. and try to examine a subset of methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution. We reach this goal by predicting a range of query properties, such as query answer size, with utilization of machine learning techniques and without relying on database statistics. Preliminary results coming from experiments on well-known public query workloads are encouraging and may confirm that data driven methods can become an added tool towards easing database-human interaction through the facilitation of query composition and analysis.

Giannis Misios

Recurrent Neural Networks for Dynamic User Intent Prediction in Human-Database Interaction

In this paper we propose a method to capture user’s intent during a user-database interaction. The need of user intent modeling is to improve the interaction between the user and the database via improving the quality of search. It is proven that human intent dynamically changes throughout the interaction, therefore we proposed the use of RNNs which are able to exhibit temporal dynamic behavior. In this paper, we propose the use of an SQL-specific embedding vector as well as the use of active learning apart from fully and incremental training. Active learning is a training method trying to find the golden section between training data (and time) and accurate predictions. Experiments conducted on a dataset constructed by query logs from the SDSS database (https://www.sdss.org/).

Markos Iliakis

Evidence-based Factual Error Correction with Fusion in Decoder

This work is about expanding the corrector of an Evidence - Based Factual Error Correction (Thorne et al., 2021) system by adding the Fusion in Decoder (Izacard et al., 2021) mechanism in its T5 neural network and thus achieving efficient multiple-passage evidence aggregation and combination of the inputs. The performance of these two systems is also compared by using the SARI score.

Giorgos Katsogiannis

Creating Locally Aligned Embeddings for Robust Schema Linking in Text-to-SQL Translation

Database systems hold vast amounts of information that have become necessary for many human activities, from biomedical research to business planning and organization. This data however, remains inaccessible to users without technical training. For this reason, great efforts are being made to create natural language interfaces for database systems, enabling all users to access and use this valuable data. One of the main challenges of the text-to-SQL problem is that of schema linking; the process of identifying the schema entities to which the user's utterance is referring. In this work we build on a previous approach for creating table embeddings, to present a novel method for creating locally aligned embeddings for schema linking. We automatically create a dataset for training and testing our method, based on the WikiSQL dataset, and evaluate different approaches using locally aligned embeddings.

Valerios Stais

Manually Detecting Errors for Data Cleaning Using Adaptive Crowdsourcing Strategies

VChecker is a tool made to optimize the crowdsourcing process when one has sets of questions that involve different data values. In this review I reproduce the authors' code and repeat their experiments to validate their results. Additionally, I introduce a simple method of simulating crowdsourcing in-silico for research purposes, and use VChecker's results to pinpoint the hyperparameter values that most closely simulate a real-world environment.

Alexandros Zerntev

A Learning Based Approach to Predict Shortest-Path Distances

Shortest path distances on road networks is a challenging task especially in a Big Data era where there are millions of nodes per area. Traditional methods traverse the graph and computes the needed shortest distances, but this methods are too slow for live applications that require an instant answers due to its time complexity. Another approach is to store all possible distances for any two nodes, but now the problem is the space complexity, for large graph they might exceed TBs of memory. To solve this problem, the authors developed an approach that combines small space needs and high accuracy by predicting a distance of two nodes via Multi Level Perceptron (MLP). In order to do that, they firstly map the initial nodes to some embedding vectors and then learn a distance function that predicts the distance between two nodes with high accuracy.

Aliki Giovitsa

Dimitris Vogiatzis

Natasha Papathoma

Restaurants in Europe: A Visualization Analysis

Abstract

A visualization analysis to explore restaurant details. By comparing common features and differentiations between restaurants in Europe, we aim in examining if there are any specific features that make a restaurant successful, as well as what it takes for a restaurant to be appreciated by visitors.

Details

Travel, Tourism & Hospitality is one of the most profitable industries globally generating almost $1.4 trillion annually (2019). Restaurants and food services in general are a big part of this sector and the consumer food service market only in Western Europe was valued at 427 billion euros (2016).

This project aims to explore the common and different features of the restaurants and identify if there are any that make a successful restaurant that is appreciated by the users. The target audience of this analysis is mainly business owners and researchers that would like to implement potentially high value features in their own business. Moreover, common travelers might also find it useful so they can identify hidden gems that have one or more of the high-value features.

Dataset

The dataset that was used for this project is hosted in Kaggle and is extracted from TripAdvisor, the most popular travel website globally.

TripAdvisor European restaurants

Acknowledgements

Data has been retrieved from the publicly available website https://tripadvisor.com/. All the restaurants from the main European countries have been scraped in early May 2021.


Antonis Papadakis

Athanasios Polydoros

Christina Borovilou

Visualizing World Happiness

Abstract

In this demonstration we try to gather, join, compare and visualize important social parameters, which provide an insight on global inequalities in various fields. We then investigate possible connections of global happiness to the aforementioned facts. To achieve that, we use World Happiness Report (amplified by other data sources) to approach happiness as a set of different data features grouping them using Maslow's pyramid of needs, a motivational theory in psychology comprising a five-tier model of human needs (physiological, safety, love and belonging, esteem, and self-actualization needs) .

Datasets

Due to the nature of the project we combined different sources of datasets to cover the scectrum of all human needs:

Kaggle - World Happiness Report Dataset | Women in parliament percentages (Gapminder) | Food and Agriculture organization of the united nations | International Labour Organization - Unemployment | World bank - Income of richest 10% | United Nations Office on Drugs and Crime- International Homicide Statistics Database | Actionaid against hunger