DeepSqueeze: Deep Semantic Compression for Tabular Data
DeepSqueeze is a deep learning technique used for table compression. It utilizes correlations between columns that traditional compression algorithms, like gzip and parquet, do not take into account. The authors employ the autoencoder architecture to capture the correlations and project each row to a smaller dimension space. We reverse-engineer this implementation managing to achieve similar results confirming its reproducibility on the 2 datasets they used. Then, we make additions and improvements like experimenting on a new real use-case dataset, improving the materialization step, and tuning the architecture hyperparameters in our attempt to reproduce and improve the compression ratio, our main evaluation metric.
Keywords: table compression, dimensionality reduction, autoencoders, reproducibility
Learning Multi-dimensional Indexes
Even nowadays systems do not take full advantage of their capabilities - methods are universal independently of the nature of the Database.
Building Flood, a multi dimensional index that can recognize the Database characteristics as well as users' needs over it, we can exploit these insights by adapting the "indexing procedure" to these parameters and make the Queries fly!
Optimal Histograms with Outliers
Histograms are a traditional and well-studied way of summarizing data. They have been extensively used in many applications that require frequency estimates, such as query plan cost estimation. In my project, I will be reviewing the paper "Optimal Histograms with Outliers" by Rachel Behar and Sara Cohen that is examining the creation of v-optimal histograms while allowing for the deletion of possible outliers in the dataset. After thoroughly studying the paper, I will recreate the most important experiments that are presented in the paper, as well as perform the same experiments on new datasets. All the algorithms described in the paper and used in the experiments have been implemented in a compact python library.
DeepDB: Learn from Data, not from Queries!
DeepDB is a powerful pure data-driven database extension that learns important characteristics of the data stored inside the database. Once it learns all the characteristics that it deems important, it builds a tree-like structure, RSPNs, that can now handle any operation that would be done in the actual database, from approximate and actual query answering, to cardinality estimation and other machine learning tasks. To get this work a step further, I decided to retain their offline adaptation and create indexes based on the data themselves. Indexes, when used correctly, can be an extremely powerful tool to data retrieval in a database. They can accelerate data retrieval immensely. To build these indexes, I not only utilised the actual DeepDB tool, but also tried to locate any inherent column correlations amongst the database tables. This work was inspired by the Hermit Indexing Algorithm.
Qd-tree: Learning Data Layouts for Big Data Analytics
Due to the huge size of data produced by modern applications and the need to process and analyze them, partitioning has become necessary. Query Data Routing Trees(QdTrees) is a new framework that helps partition data by taking into account the metric of the numbers of blocks skipped, which relates to I/O costs and the efficiency of queries. Given a specific set of queries, a QdTree is constructed with the purpose of optimizing the aforementioned metric. A QdTree is used in both partitioning data as well as enhancing queries to take advantage of the completed partitioning. One of the ways proposed to construct such trees is Deep Reinforcement Learning, which leads to gradually better trees using Reinforcement Learning while taking advantage of Deep Learning to address high dimensionality.
SQL Query Completion for Data Exploration
Nowadays, it is more and more frequent to have several databases available for one application domain, some of them with hundreds of tables and/or attributes. Reaching the data of interest in these cases, by writing SQL Queries could be challenging, especially for a new data scientist. So it is obvious, that further exploratory analysis of data is crucial and must always come before start posing queries to databases, which store large scale data. Having in mind that Jupyter Notebook is a web-based environment that enables interactive data analysis and visualizations, and by being influenced by , , a notebook is developed which brings together effective tools for data exploration. Τhe first part of the project is dedicated to the presentation of the visualization tools which are used in the notebook for Exploratory Data Analysis (EDA). By using these tools, one can visually analyze numerical and categorical attributes of a Query in order to understand the data and the relationships between the attributes. The second part of this project is referred to the importance of a library called DataComPy6, which can be used to compare two Queries, in order to find similarities and overlaps. Of course, this tool is included in the developed notebook as well, in order to further improve the exploratory procedure. Finally, the main focus of this project relies on the report and implementation of the notion of ‘SQL query completion’. This idea is presented in detail in . Based on this idea, the main contribution of this project is first to implement this notion and second to release an improved approach. As it will be shown, our approach is an improved version of ‘SQL query completion’ and of course this tool frames the notebook as well. The combination of the aforementioned tools is about to create a powerful notebook for data exploration, that will help a user in query formulation. It is worth mentioning, that the databases that were used in the experiments of this project, in order to evaluate the usage of the notebook are nba_salary.sqlite and database.sqlite.
Duoquest: A Dual-Specification System for Expressive SQL Queries
Querying a relational database is difficult for users without knowledge of SQL language. Many systems nowadays try to overcome this difficulty and provide alternatives for those users. One of them is Duoquest, a system that consumes both a natural language query and a table to produce candidate SQL query translations, and achieves significant improvement over a state-of-the-art Natural language interface. In this work we take a closely look at the reasons of this improvement and we try to achieve even better results without further user's effort.
Dynamic Query Refinement for Interactive Data Exploration
Efficient data querying and exploration are often hindered by the user's poor dataset awareness or overly strict queries. Instead, a query can be dynamically refined when needed, to provide an adequate number of results based on user needs. An overflow of returned data is avoided by carefully restricting the query to the best results, while a scarce result set can be enriched by relaxing the specific constraints expressed on the query. Following a work of Kalinin et al., we saw a system that implements such a refinement in the context of Constraint Programming queries and Searchlight. We further explore this idea by replacing their proposed SciDB underlining system with PostgreSQL, offering an open-source implementation of the algorithm described in the original paper.
Facilitating SQL Query Composition and Analysis
One of the more challenging tasks in the broader technology community is communicating efficiently with a system and in particular with a database. Database – person communication is very frequently achieved through the use of SQL queries. Even though SQL queries can be rather simple to formulate and easy to utilize for users familiar with computer science, inexperienced users might require several cycles of tuning and execution to reach the desired output. In this work we are based on the paper from Zainab Zolaktaf et al. and try to examine a subset of methods that can accelerate and improve this interaction by providing insights about SQL queries prior to execution. We reach this goal by predicting a range of query properties, such as query answer size, with utilization of machine learning techniques and without relying on database statistics. Preliminary results coming from experiments on well-known public query workloads are encouraging and may confirm that data driven methods can become an added tool towards easing database-human interaction through the facilitation of query composition and analysis.
Recurrent Neural Networks for Dynamic User Intent Prediction in Human-Database Interaction
In this paper we propose a method to capture user’s intent during a user-database interaction. The need of user intent modeling is to improve the interaction between the user and the database via improving the quality of search. It is proven that human intent dynamically changes throughout the interaction, therefore we proposed the use of RNNs which are able to exhibit temporal dynamic behavior. In this paper, we propose the use of an SQL-specific embedding vector as well as the use of active learning apart from fully and incremental training. Active learning is a training method trying to find the golden section between training data (and time) and accurate predictions. Experiments conducted on a dataset constructed by query logs from the SDSS database (https://www.sdss.org/).
Evidence-based Factual Error Correction with Fusion in Decoder
This work is about expanding the corrector of an Evidence - Based Factual Error Correction (Thorne et al., 2021) system by adding the Fusion in Decoder (Izacard et al., 2021) mechanism in its T5 neural network and thus achieving efficient multiple-passage evidence aggregation and combination of the inputs. The performance of these two systems is also compared by using the SARI score.
Creating Locally Aligned Embeddings for Robust Schema Linking in Text-to-SQL Translation
Database systems hold vast amounts of information that have become necessary for many human activities, from biomedical research to business planning and organization. This data however, remains inaccessible to users without technical training. For this reason, great efforts are being made to create natural language interfaces for database systems, enabling all users to access and use this valuable data. One of the main challenges of the text-to-SQL problem is that of schema linking; the process of identifying the schema entities to which the user's utterance is referring. In this work we build on a previous approach for creating table embeddings, to present a novel method for creating locally aligned embeddings for schema linking. We automatically create a dataset for training and testing our method, based on the WikiSQL dataset, and evaluate different approaches using locally aligned embeddings.
Manually Detecting Errors for Data Cleaning Using Adaptive Crowdsourcing Strategies
A Learning Based Approach to Predict Shortest-Path Distances
Shortest path distances on road networks is a challenging task especially in a Big Data era where there are millions of nodes per area. Traditional methods traverse the graph and computes the needed shortest distances, but this methods are too slow for live applications that require an instant answers due to its time complexity. Another approach is to store all possible distances for any two nodes, but now the problem is the space complexity, for large graph they might exceed TBs of memory. To solve this problem, the authors developed an approach that combines small space needs and high accuracy by predicting a distance of two nodes via Multi Level Perceptron (MLP). In order to do that, they firstly map the initial nodes to some embedding vectors and then learn a distance function that predicts the distance between two nodes with high accuracy.
Restaurants in Europe: A Visualization Analysis
Visualizing World Happiness