Keynote Talks – 34th IEEE International Conference on Data Engineering

An NVM Carol: Visions of NVM Past, Present, and Future.
Margo Seltzer (Harvard, USA)
My Top Ten Fears about the DBMS Field
Michael Stonebraker (MIT, USA)
Data and Data Science in Enterprise Evolution: Some Technical Challenges and Beyond
Jian Pei (Simon Fraser University, Canada)
Dask: Parallelizing the Python Data Ecosystem with Task Scheduling
Matthew Rocklin (Anaconda Inc.)
Actor-Oriented Database Systems
Philip Bernstein (Microsoft Research, USA)
Human Factors in Data Science
Sihem Amer-Yahia (CNRS, University Grenoble Alpes, France)

An NVM Carol: Visions of NVM Past, Present, and Future

Chair: Gustavo Alonso (ETH)

Around 2010, we observed significant research activity around the development of non-volatile memory technologies. Shortly thereafter, other research communities began considering the implications of non-volatile memory on system design, from storage systems to data management solutions to entire systems. Finally, in July 2015, Intel and Micron Technology announced 3D XPoint. We can view non-volatile memory technology and its impact on systems through a historical lens revealing it as the convergence of several past research trends starting with the concept of single-level store, encompassing the 1980’s hype around bubble memory, building upon persistent object systems, and levaraging recent work in transactional memory. We present this historical context, recalling past ideas that seem particularly relevant and potentially applicable and highlighting aspects that are particularly novel.

Margo I. Seltzer

Margo I. Seltzer is currently the Herchel Smith Professor of Computer Science and the Faculty Director for the Center for Research on Computation and Society in Harvard’s John A. Paulson School of Engineering and Applied Sciences. In September 2018, she will assume a Canada 150 Research Chair and become the Director of the Computer Systems Laboratory at the University of British Columbia. Her research interests are in systems, construed quite broadly: systems for capturing and accessing provenance, file systems, databases, transaction processing systems, storage and analysis of graph-structured data, new architectures for parallelizing execution, and systems that apply technology to problems in healthcare and the judicial system. Dr. Seltzer was a founder and CTO of Sleepycat Software, the makers of Berkeley DB, and is also an Architect at Oracle Corporation. She is a past President of the USENIX Association, a Sloan Foundation Fellow in Computer Science, and an ACM Fellow. She is recognized as an outstanding teacher and mentor, having received the Phi Beta Kappa teaching award in 1996, the Abrahmson Teaching Award in 1999, and the Capers and Marion McDonald Award for Excellence in Mentoring and Advising in 2010. Dr. Seltzer received an A.B. degree in Applied Mathematics from Harvard/Radcliffe College in 1983 and a Ph. D. in Computer Science from the University of California, Berkeley, in 1992.

My Top Ten Fears about the DBMS Field

Chair: Malu Castellanos (Teradata)

In this talk, I present my top ten fears about the future of the DBMS field, with apologies to David Letterman. There are three “big fears”, which I discuss first. Five additional fears are a result of the “big three”. I then conclude with “the big enchilada”, which is a pair of fears. In each case, I indicate what I think is the best way to deal with the current situation.

Michael Stonebraker

Michael Stonebraker is an adjunct professor at MIT and a recipient of the 2014 ACM Turing Award. He earned his bachelor’s degree from Princeton University in 1965 and his master’s degree and his Ph.D. from the University of Michigan in 1967 and 1971, respectively. His awards include the ACM System Software Award (1992), the first SIGMOD Edgar F. Codd Innovations Award (1994), and the IEEE John von Neumann Medal (2005). In 1994 he was inducted as a Fellow of the Association for Computing Machinery. In 1997 he was elected a member of the National Academy of Engineering. In September 2015 he won the 2015 Commonwealth Award, chosen by council members of MassTLC.

Prof. Stonebraker has been a pioneer of data base research and technology for more than a quarter of a century. He was the main architect of the INGRES relational DBMS, and the object-relational DBMS, POSTGRES. These prototypes were developed at the University of California at Berkeley where Stonebraker was a Professor of Computer Science for twenty five years. More recently at MIT he was a co-architect of the Aurora/Borealis stream processing engine, the C-Store column-oriented DBMS, the H-Store transaction processing engine, the SciDB array DBMS, and the Data Tamer data curation system.

Data and Data Science in Enterprise Evolution: Some Technical Challenges and Beyond

Chair: Beng Chin Ooi (National University of Singapore)

The importance of data and data science in large enterprises cannot be over emphasized, particularly for a company under fast evolution. In this talk, I will share several exciting stories about some data and data science challenges in the context of enterprise evolution, including data integration, platforms and data products. Those challenges are, more often than not, a mix of technical ones and beyond. I hope the understanding of data and data science challenges in enterprise evolution can inspire novel research opportunities and ideas.

Jian Pei

Always eager to meet new challenges and opportunities, Jian Pei is currently Vice President, Big Data Platform and Products/Intelligent Supply Chains, at JD.com, China’s largest online retailer and its biggest overall retailer, as well as the country’s biggest Internet company by revenue. He is currently on leave from Simon Fraser University and holding the position of a Canada Research Chair (Tier 1) in Big Data Science. Recognized as an ACM Fellow and an IEEE Fellow, he published over 200 technical publications, which have been cited by 76000+ times, 33000+ in the last 5 years. His research has generated remarkable impact substantially beyond academia.

Dask: Parallelizing the Python Data Ecosystem with Task Scheduling

Chair: Jens Dittrich (Saarland University)

We work to parallelize complex algorithms within the existing Python data analytics stack.

The Python data analytics stack consists of thousands of packages based off of foundational libraries like Numpy, Pandas, and Scikit-Learn. These libraries couple ease-of-use with efficient low-level code and sophisticated algorithms. However they are also often limited to sequential computation on in-memory data, a limitation which affects the downstream ecosystem of software. Increasing data sizes and availability of parallel computing encourages us to parallelize this software stack. Unfortunately the sophistication of the data structures and algorithms used makes this difficult with common big data technologies which favor tabular data layouts and JVM infrastructure which are often unsuitable for complex array, image, or machine learning algorithms.

We address this with Dask, a library for dynamic distributed task scheduling, that has been effectively used within all of these projects to scale their algorithms to large datasets either on multi-core computers or multi-machine clusters. We discuss both the technical advantages and disadvantages of the task scheduling approach as well as the social results of this work within the software community.

Matthew Rocklin

Matthew is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago where he focused on numerical linear algebra, task scheduling, and computer algebra. Matthew lives in Brooklyn, NY and is employed by Anaconda Inc.

Actor-Oriented Database Systems

Chair: Panos Chrysanthis (University of Pittsburgh)

Many of today’s interactive, stateful, server applications are processor-intensive and must be scalable and elastic. Hence, they are usually implemented as middle-tier objects backed by a key-value store in cloud storage, rather than as stored procedures in a database system. This enables the system to scale elastically by adding or removing inexpensive middle-tier servers. Example applications include multi-player games, social networking, mobile computing, telemetry, and Internet of Things. When the objects are single-threaded and do not share memory, they are called actors. There are dozens of programming frameworks for building actor applications, such as Akka, Erlang, Orbit, and Orleans.

Although the applications do not use a database system, they can benefit from database abstractions, such as transactions, indexing, queries, streams, replication, and geo-distribution. We therefore propose a new type of database system, called an actor-oriented database system, which supports these features over middle-tier objects and which works with any cloud storage system. As in a persistent programming language, these features must be well-integrated into the programming language, but the emphasis here is on distributed computing capabilities rather than language integration. In this talk, we will describe the requirements for such a system, the technical challenges in building it, and solutions to some of the challenges.

Philip Bernstein

Philip A. Bernstein is a Distinguished Scientist at Microsoft Research. He has published over 150 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, which are still the major focus of his research. He is an ACM Fellow, a AAAS Fellow, a winner of ACM SIGMOD’s Codd Innovations Award, a member of the Washington State Academy of Sciences, and a member of the U.S. National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.

Human Factors in Data Science

Chair: Ioana Manolescu (INRIA)

Data Science (DS) has been shifting from libraries and stacks to usage and impact. While “database thinking” is permeating all levels in a DS stack, the DS lifecycle can only be fully realized by looping in humans in a principled and safe fashion. This talk’s focus is on the role of humans and user data in DS. It starts with the impact of human factors on the design of sustainable and fair data generation and curation. It then reviews how data processing and mining are being revisited to derive insights from user data. That is followed by how human-data interaction shapes the way we think about evaluating DS applications. The talk ends with opportunities that arise when bringing together database thinking and DS for humans.

Sihem Amer-Yahia

Sihem is a CNRS Research Director. Her interests are at the intersection of large-scale data management and social data exploration. Sihem held positions at QCRI, Yahoo! Research and at&t Labs. She served on the SIGMOD Executive Board, the VLDB Endowment, and the EDBT Board. She is Editor-in-Chief of the VLDB Journal. Sihem serves as co-chair of PVLDB 2018, WWW 2018 Tutorials and ICDE 2019 Tutorials.