Tutorials

Tutorial 1
Sorting in Space and Words

Hanan Samet

Tutorial 2:
Cross-Platform Data Processing: Use Cases and Challenges

Zoi Kaoudi (QCRI), Jorge-Arnulfo Quiane-Ruiz (QCRI)

Tutorial 3:
Data Security and Privacy for Outsourced Data in the Cloud

Cetin Sahin and Amr El Abbadi

Tutorial 4:
Online Temporal Analysis of Complex Systems using IoT Data Sensing

Avigdor Gal, Arik Senderovich, and Matthias Weidlich

Tutorial 5:
Machine Learning to Data Management: A Round Trip

Laure Berti-Equille (Aix-Marseille University, France), Angela Bonifati (University Claude Bernard Lyon 1, France), Tova Milo (Tel Aviv University, Israel) [Slides]

Tutorial 6:
Blockchains and Databases: A New Era in Distributed Computing

C. Mohan
Sorting in Space and Words

Spatial data has traditionally been specified geometrically and explicitly. Accessing it quickly depends on its representation and is important in many computer graphics applications. Some common representations include quadtrees, octrees, pyramids, and bounding boxes hierarchies. They enable indexing into space and are really multidimensional sorts thereby facilitating search. Users of new applications find explicit geometric specifications (e.g., as latitude-longitude number pairs) to be cumbersome as they do not view locations this way, nor are they accustomed to communicate them this way. Instead, they specify locations textually which is easy especially on smartphone devices where a soft keyboard is always present. The text also acts like a polymorphic type in the sense that one size fits all so that a term such as “Washington” can be interpreted as both a point or an area and users need not be concerned. The drawback of textual specifications is the need to overcome ambiguity issues which involves recognizing a location and disambiguating it. Locations can also be specified implicitly in a device with a gesturing interface by combining a map with the ability to pan and vary the zoom level of viewing to yield an approximate location specification. Making its interpretation dependent on the zoom level is equivalent to using spatial synonyms.

The course provides a brief overview of hierarchical spatial data structures for regions, points, lines, and small rectangle collections based on sorting them. For nonregion data, we show how they facilitate finding nearest neighbors. We also discuss the advantages and pitfalls of processing textually specified spatial data. We illustrate browsing geometrically and textually specified data with the JAVA Spatial Data Applets, SAND Spatial Browser, and NewsStand spatiotextual query system for news articles.

Hanan Samet

Hanan Samet is a Distinguished University Professor of Computer Science at the University of Maryland, College Park. He received the B.S. degree in engineering from UCLA, and the M.S. Degree in operations research and the M.S. and Ph.D. degrees in computer science from Stanford University. His doctoral dissertation dealt with proving the correctness of translations of LISP programs which was the first work in translation validation and the related concept of proof-carrying code. He is the author of the recent book Foundations of Multidimensional and Metric Data Structures published by Morgan-Kaufmann, an imprint of Elsevier, in 2006, an award winner in the 2006 best book in Computer and Information Science competition of the Professional and Scholarly Publishers (PSP) Group of the American Publishers Association (AAP), and of the first two books on spatial data structures Design and Analysis of Spatial Data Structures, and “Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS”, both published by Addison-Wesley in 1990. He is the Founding Editor-In-Chief of the ACM Transactions on Spatial Algorithms and Systems (TSAS), the founding chair of ACM SIGSPATIAL, 2009 UCGIS Research Award, 2011 ACM Paris Kanellakis Theory and Practice Award, 2014 IEEE Computer Society Wallace McDowell Award, and a Fellow of the ACM, IEEE, AAAS, IAPR (International Association for Pattern Recognition), and UCGIS (University Consortium for Geographic Science). He has received best paper awards in the 2007 Computers & Graphics Journal, the 2008 ACM SIGMOD and SIGSPATIAL ACMGIS Conferences, the 2012 SIGSPATIAL MobiGIS Workshop, the 2013 SIGSPATIAL GIR Workshop, as well as best demo paper awards at the 2011 and 2016 SIGSPATIAL ACMGIS Conferences. His paper at the 2009 IEEE International Conference on Data Engineering (ICDE) was selected as one of the best papers for publication in the IEEE Transactions on Knowledge and Data Engineering.

Cross-Platform Data Processing: Use Cases and Challenges

There is a zoo of data processing platforms which help users and organizations to extract value out of their data. Although each of these platforms excels in specific aspects, users typically end up running their data analytics on suboptimal platforms. This is not only because choosing the right platform among the myriad of big data platforms is a daunting task, but also due to the fact that today’s data analytics are moving beyond the limits of a single platform. Thus, there is an urgent need for cross-platform data processing, i.e., using more than one data processing platform to perform a data analytics task. Despite the need, achieving this is still a dreadful process where developers have to get intimate with many systems and write ad hoc scripts for integrating them. This tutorial is motivated by this need. We will discuss the importance of supporting cross-platform data processing in a systematic way as well as the current efforts to achieve that. In particular, we will introduce a classification of the different cases where an application needs or benefits from cross-platform data processing and the challenges of each case. Along with this classification, we will also present the efforts known up to date to support cross-platform data processing. We will conclude this tutorial with a discussion of several important open problems.

Zoi Kaoudi

Zoi Kaoudi is a scientist in the Qatar Computing Research Institute (QCRI), HBKU. She has previously worked in IMIS-ATHENA RC as a research associate and INRIA as a postdoctoral researcher. She received her PhD from the National and Kapodistrian University of Athens in 2011. She has previously presented tutorials at ICDE 2013 and SIGMOD 2014. Her research interests include machine learning systems, big data management, and distributed RDF query processing and reasoning.

Jorge Arnulfo Quiane-Ruiz

Jorge Arnulfo Quiane-Ruiz is a Senior Scientist at the Qatar Computing Research Institute (QCRI), HBKU. He has previously worked in Saarland University and INRIA. He received his PhD from University of Nantes in 2008. He has previously presented tutorials at VLDB 2012 and received an Excellent Presentation Award at VLDB 2014. His research mainly focuses on efficient and scalable big data management.

Data Security and Privacy for Outsourced Data in the Cloud

Although outsourcing data to cloud storage has become popular, the increasing concerns about data security and privacy in the cloud blocks broader cloud adoption. Ensuring data security and privacy, therefore, is crucial for better and broader adoption of the cloud. This tutorial provides a comprehensive analysis of the state-of-the-art in the context of data security and privacy for outsourced data. We aim to cover common security and privacy threats for outsourced data, and relevant novel schemes and techniques with their design choices regarding security, privacy, functionality, and performance. Our explicit focus is on recent schemes from both the database and the cryptography and security communities that enable query processing over encrypted data and access oblivious cloud storage systems.

Cetin Sahin

Cetin Sahin is a Senior Developer at SAP Big Data Services. Cetin earned his B.Sc. in Computer Science from Bilkent University, Turkey in 2011, and master’s and Ph.D. degrees in Computer Science from University of California, Santa Barbara in 2016 and 2017, respectively. His Ph.D. dissertation focuses on data security and privacy in the cloud. He was a summer research assistant at NEC Laboratories in 2013 and 2014.

Amr El Abbadi

Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow. He was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals and as program chair for multiple database and distributed systems conferences. He currently serves on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Most recently Prof. El Abbadi was the co-recipient of the Test of Time Award at EDBT/ICDT 2015. He has published over 300 articles in databases and distributed systems and has supervised over 35 PhD students.

Online Temporal Analysis of Complex Systems using IoT Data Sensing

Temporal analysis for online monitoring and improvement of complex systems such as hospitals, public transportation networks, or supply chains has been in the focus of several areas in operations management. These include queueing theory for bottleneck analysis, mathematical scheduling for resource assignments to customers, and inventory management for ordering products under uncertain demand. In recent years, with the increasing availability of data sensed by Internet-of-Things (IoT) infrastructures, these online temporal analyses drift towards automated and data-driven solutions. In this tutorial, we cover existing approaches to answer online temporal queries based on sensed data. We discuss two complementary angles, namely operations management and machine learning. The operational approach is driven by models, while machine learning methods are grounded in feature encoding. Both techniques require methods for translating low-level data readings coming from sensors into high-level activities with their temporal relations. Further, some of the techniques consider only dependencies of the sensed entities on their own individual histories, while others take into account dependencies between entities that share system resources. We outline the state-of-the-art in temporal querying, with demonstrations of interesting phenomena and main results using a real-world case study in the healthcare domain. Finally, we chart the territory of online data analytics for complex systems in a broader context and provide future research directions.

Avigdor Gal

Avigdor Gal is a Professor at the Technion – Israel Institute of Technology, were he leads the Data Science & Engineering program. He specializes in various aspects of data management and mining with about 150 publications in journals (Journal of the ACM (JACM), ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), Information Systems, and the VLDB Journal), books, and conferences (SIGMOD, VLDB, ICDE, CIKM, BPM, ER, CoopIS). He is a co-author of several papers on process mining. He served as a program co-chair and general co-chair of several conferences, including BPM and DEBS. In the past he gave tutorials in SIGMOD, VLDB, EDBT, and CAiSE. Avigdor Gal is a recipient of the prestigious Yannai award for excellence in academic education.

Arik Senderovich

Arik Senderovich is the Lyon Sachs postdoctoral fellow at Toronto Intelligent Decision Engineering Laboratory (TIDEL) in University of Toronto. He received his BSc in Industrial Engineering and Management, followed by MSc in Statistics and PhD in Data Science – all at the Technion (Israel Institute of Technology). The focus of his PhD thesis was the study of combining Queueing Theory, Process Management, and Data Mining.

Matthias Weidlich

Matthias Weidlich is a faculty member at the Department of Computer Science at Humboldt-Universitt zu Berlin, where he leads the Process-Driven Architectures group. Matthias’ research focuses on process-oriented and event-based information systems. His results appear regularly in premier conferences (SIGMOD, VLDB, ICDE, IJCAI, BPM, CAiSE) and journals (TSE, TKDE, Information Systems, VLDB Journal) in the field. He is a Junior-Fellow of the German Informatics Society (GI) and in 2016 received the Berlin Research Award (Young Scientist). He serves as PC Co-Chair of the ACM DEBS 2018 conference and is an area editor for Elsevier’s Information Systems. In the past, he gave tutorials at CAiSE, AAMAS, and DEBS.

Machine Learning to Data Management: A Round Trip

With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model “learn” and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out of the box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adaptation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.

Laure Berti-Equille

Laure Berti-Equille received her Ph.D. degree in Computer Science from University of Toulon in France in 1999. From 2000-2010, she was a tenured Associate Professor at University of Rennes 1, and a 2-years visiting researcher at AT&T Labs Research in New Jersey, USA, as a recipient of the prestigious European Marie Curie Outgoing Fellowship (2007-2009). From 2011-2017, she joined IRD, the French Institute of Research for Development, as a Research Director. From 2014-2017, she was a Senior Scientist at Qatar Computing Research Institute (Hamad Bin Khalifa University). She is now is a full Professor at Aix-Marseille University (AMU) in France. Her interests are at the intersection of large-scale data science, data analytics, and machine learning with a focus on data quality and truth discovery research. She initiated the very first workshop editions on information and data quality in information systems (IQIS 2005) and in databases (QDB 2009 and 2016) in conjunction with SIGMOD and VLDB respectively, and co-organized the first French workshops on Data and Knowledge Quality in conjunction with EGC (Extraction et Gestion de Connaissances) in 2005, 2006, 2010, and 2011. Laure is serving as an associated editor of the ACM Journal on Data and Information Quality and served as a Program Chair of the International Conferences on Information Quality (ICIQ) in 2012 and 2016. She has received various grants from the French Agency for National Research (ANR), the French National Research Council (CNRS), and the European Union.

Angela Bonifati

Angela Bonifati received her Ph.D. degree in Computer Science from Politecnico di Milano in 2002. After graduating she worked as a postdoctoral researcher at the INRIA research institute in Paris. She then obtained a permanent position as a researcher at the Italian National Research Council in 2003. She is now a full Professor in France (since 2011), currently at University of Lyon 1. Her research focuses on advanced database applications such as data integration and exchange, web and graph databases, query inference by considering both structured and semi-structured data models. She has been visiting professor in several foreign universities, such as Stanford University, UBC and Saarland University. Angela served as the Program Chair of several international conferences, including ICDE 2011 (Semi-structured data Track) and ICDE 2018 (Information Extraction and Data Cleaning and Curation Track), WebDB 2013, and XSym 2009. She is currently associate editor of the VLDB Journal, ACM Transactions on Database Systems (TODS) and Distributed and Parallel Databases. She has been the recipient of the prestigious Palse Impulsion Starting Grant at the University of Lyon (IDEX) in 2016. She has received grants from the French and Italian Ministry of Science and the French National Research Council (CNRS).

Tova Milo

Tova Milo received her Ph.D. degree in Computer Science from the Hebrew University, Jerusalem, in 1992. After graduating she worked at the INRIA research institute in Paris and at University of Toronto and returned to Israel in 1995, joining the School of Computer Science at Tel Aviv university, where she is now a full Professor and holds the Chair of Information Management. She served as the Head of the Computer Science Department from 2011-2014. Her research focuses on large-scale data management applications such as data integration, semi-structured information, Data-centered Business Processes and Crowd-sourcing, studying both theoretical and practical aspects. Tova served as the Program Chair of several international conferences, including PODS, VLDB, ICDT, XSym, and WebDB, and as the chair of the PODS Executive Committee. She served as a member of the VLDB Endowment and the PODS and ICDT executive boards and as an editor of TODS, IEEE Data Eng. Bull, and the Logical Methods in Computer Science Journal. Tova has received grants from the Israel Science Foundation, the US-Israel Binational Science Foundation, the Israeli and French Ministry of Science and the European Union. She is an ACM Fellow, a member of Academia Europaea, a recipient of the 2010 ACM PODS Alberto O. Mendelzon Test-of-Time Award, the 2017 VLDB Women in Database Research award, the 2017 Weizmann award for Exact Sciences Research, and of the prestigious EU ERC Advanced Investigators grant.

Blockchains and Databases: A New Era in Distributed Computing

In the last few years, blockchain (also known as distributed ledger), the underlying technology of the permissionless or public Bitcoin network, has become very popular for use in private or permissioned environments. Computer companies like IBM and Microsoft, and many key players in different vertical industry segments have recognized the utility of blockchains for securely managing assets (physical/digital) other than cryptocurrencies. IBM did some pioneering work by architecting and implementing a private blockchain system, and then open sourcing it. That system, which has since then been named Fabric, is being enhanced via the Hyperledger Consortium set up under the auspices of the Linux Foundation. Other efforts in the industry include Enterprise Ethereum and R3 Corda. While currently there is no standard in the blockchain space, all the ongoing efforts involve some combination of database, transaction, encryption, virtualization, consensus and other distributed systems technologies. Some of the application areas in which blockchain pilots are being carried out are: smart contracts, food safety, logistics, supply chain management, Know Your Customer (KYC), derivatives processing and provenance management. A number of production deployments are also in place now.

In this tutorial, I survey some of the ongoing projects with respect to their architectures in general and their approaches to some specific technical areas. In particular, I focus on how the functionality of traditional and modern data stores are being utilized or not utilized in the different blockchain projects. Because of the attention the world is paying to blockchain technologies, it is important for the database community to become more aware of the underlying technologies and other developments in this area. Then, the community could try to influence the approaches taken and, in particular, how database technologies could be better utilized or enhanced for blockchains. Since most of the blockchain efforts are still in a nascent state, the time is right for database researchers and practitioners to get more deeply involved!

C. Mohan

Dr. C. Mohan has been an IBM researcher for 36 years in the database area, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the ARIES family of database locking and recovery algorithms, and the Presumed Abort commit protocol. This IBM (1997), and ACM/IEEE (2002) Fellow has also served as the IBM India Chief Scientist for 3 years (2006-2009). In addition to receiving the ACM SIGMOD Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the US and Indian National Academies of Engineering (2009), and was named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. He is currently focused on Blockchain, Big Data and HTAP technologies (http://bit.ly/CMbcDB, http://bit.ly/CMgMDS). Since 2016, he has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. He has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. Mohan is a frequent speaker in North America, Europe and India, and has given talks in 40 countries. He is very active on social media and has a huge network of followers. More information could be found in the Wikipedia page at http://bit.ly/CMwIkP