Distributed Data Summit Confirmed Sessions

We are continuing to announce the sessions. Bookmark this page and expect frequent updates.

Keynote: Apache Cassandra: The Road to 4.0

Nate McCall - Apache Software Foundation

We have come a long way in 10 years. What started as a side project to solve a specific problem using a mashup of distributed computing techniques has morphed into a mission critical database relied on by some of the worlds largest companies. The journey to this point had it's ups and downs, drama and intrigue. But it sure makes for a good story.
In this keynote, we'll tell this story, talking frankly about the changes in the community over the past couple of years and where we are now as a project. From there, we'll look out towards our future with the upcoming release of 4.0, detailing some of the latest features and improvements: technically and as a community.
Attendees of this keynote will gain an understanding of where we are as a project, how we got to this point, where we are going next and, most importantly, how they can become involved in the journey.

Documentation, documentation, documentation!!!

Pedro Vidigal - Pythian

The purpose of this talk is to discuss the importance of the documentation in an open source project such as Cassandra, and showing how easy it can be for anyone to build and improve the Apache Cassandra documentation.

An Introduction to FoundationDB for Cassandra Users

Ted Wilmes - Expero

Apache Cassandra and its variants handle a wide variety of modern distributed data problems with aplomb but certain use cases may benefit from the addition of a new complementary tool into your distributed database toolbox, FoundationDB. FoundationDB is a distributed ACID key-value store that was recently open sourced by Apple and is gaining interest due to the unique and powerful primitives that it provides application and database developers. This talk will introduce the FoundationDB architecture, data model, and client drivers and walk through the development of a simple time series application that makes use of FoundationDB's ACID guarantees to overcome challenges eventual consistency may provide in certain scenarios.
Intended audience: Cassandra and other distributed database users that are interested in adding a distributed ACID option into their toolkits.
Requirements: Basic knowledge of distributed databases.

VNodes... and 4.0

Kurt Greaves - Instaclustr

A history of vnodes since their introduction in Apache Cassandra 1.2, analysing their usage and the bumps we've hit along the way, and where vnodes are going from here. This talk will give an in depth overview of vnodes, the problems they are trying to solve, and how they interact with the rest of the system. I'll be talking about what use cases vnodes are good for, and when they should be avoided and what hurdles you can expect if you use them, As this is a discussion of the token ring we end up talking about most important aspects of Cassandra, such as repairs, [anti-]compactions, queries and replication. I'll then talk about how 4.0 is going to help with vnodes, and how upcoming changes in 4.0 will improve the vnodes usecase.
Intended audience: Apache Cassandra administrators, contributors, developer
Technical skills and concepts required: Some knowledge about the token ring, Basic knowledge of repairs, compactions, replication.

Five things I wish I'd known about distributed systems

Jonathan Ellis - DataStax

In 1999, Eric Brewer published the CAP theorem, showing that distributed systems can offer at most two of Consistency, Availability, and Partition tolerance. Apache Cassandra is almost unique among distributed systems in choosing primarily an AP design. In this talk, DataStax co-founder and former Apache Cassandra PMC Chair, Jonathan Ellis, will describe some of the unique challenges in building an AP database, how Cassandra met them, and how those decisions have proven out with time.

The Why, What, and How of Cassandra Repair

Aaron Morton - The Last Pickle

No one likes Repair, but (almost) everyone needs it. It is the way we have to ensure consistent data on disk, which makes everyone feel better. In this talk Aaron Morton, CEO at The Last Pickle and Apache Cassandra PMC member, will discuss why we we need Repair, what it does, how it has changed, and how the Cassandra Reaper tool is used to manage repair for installations that have 1,000s of nodes.

Was Cassandra the right baseline for ScyllaDB?

Dor Laor - ScyllaDB

The ScyllaDB team has been re-implementing Cassandra in C++ for the past 4 years, but was Cassandra the right baseline? Sure, Cassandra has hundreds of men years behind it and design wins that other frameworks can only imagine, but there are caveats too. If you follow the rest of the industry, from multi-tenant CosmosDB to transactional CockroachDB there’s a lot going on. Even within the Cassandra community there are different voices around sidecar management processes.
It’s a good time to take a step back and consider whether it was right to embrace the Cassandra design and APIs versus starting from a clean slate. This talk will cover programming language choices (Java, Go, C++), the use of existing engines (RocksDB), implementation of secondary indexes and management consoles, and more.

Running Apache Cassandra on Kubernetes

Ben Bromhead - Instaclustr

This session will introduce the Cassandra Kubernetes Operator, an open source Cassandra controller that provides robust, managed Cassandra deployments on Kubernetes.
Kubernetes has become the most popular container orchestration and management API with cloud-native support from AWS, GCP, Azure and a growing enterprise support ecosystem. Leveraging Kubernetes to provide tested, repeatable deployment patterns that follow best practices is a win for both developers and operators. Specifically, this talk will cover:
• What are containers and why are they useful?
• What is kubernetes and why is it useful?
• Managing state in container environments
• Running Apache Cassandra in Kubernetes

OLTP or Analytics? Why not both?

Avi Kivity - ScyllaDB

OLTP and Analytics are very different. One is characterized by many concurrent small requests, with a high sensitivity to latency, while the other typically processes large streams of data with more emphasis on throughput.
The talk will cover:
- the different requirements of the two workloads
- how ScyllaDB optimizes for both
- performance isolation of different workloads within ScyllaDB
- how ScyllaDB supports concurrent OLTP and Analytics without sacrificing either latency or throughput
- measurements
Intended audience: NoSQL users that have both OLTP and analytics requirements.
Requirements: Familiarity with NoSQL technologies such as ScyllaDB, Cassandra, and Spark is helpful.

Cassandra Performance Tuning and Crushing SLAs

Jon Haddad - The Last Pickle

Description: In an ideal world, everything would just be fast out of the box. Unfortunately, we’re not quite there yet. Getting the best performance out of a database means understanding your entire system, from the hardware and OS to the databases’s internals. In this talk, Jon Haddad will discuss a wide range of performance tuning techniques. We’ll start by examining how to measure and interpret the statistics from the different components on our machines. Once we understand how to identify what exactly is holding our performance back, we can take the necessary steps to address the problem and move to the next issue. We’ll examine common pitfalls and problems, and learn how to tune:
* Counters
* Compaction
* Garbage collection
* Compression, and more.
If you’re working on a low latency, high throughput system (or just want to save some money on your AWS bill!) you won’t want to miss this talk.

Looking towards an Official Cassandra Sidecar

Vinay Chella / Joseph Lynch - Netflix

Operating C* can involve a lot of required manpower,complex
automation, or both. Some of this complexity comes from operational/configuration activity of the underlying kernel and hardware but much of it is operation complexity stemming from C* itself. Some examples of this complexity are restarting the database in a safe way, reliability backing up and restoring snapshots, monitoring health of the datastore, and even ensuring eventual consistency through repair. As a result of these complexities, C* operators endup with complicated operational setups, which are expensive to build, manage and monitor. As part of this talk we will share lessons learned in managing such complexity via our Priam sidecar including recent innovations in how our sidecar ensures the highest possible uptime and correctness of Cassandra. We then use this to motivate building in the management sidecar directly as part of C* itself (CASSANDRA-14395). At the end of this talk, the audience will know more about what management techniques are required to operate Cassandra and what to expect and what not to expect in the official sidecar going forwards.
Intended audience:​ This talk is intended for users who are running C* at any scale in their environment.

Query and Audit logging in Cassandra

Vinay Chella - Netflix

​Audit logging is one of the most critical features in an enterprise ready database in terms of security compliance. Furthermore live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for several needs including troubleshooting or even live traffic replay.
Intended audience: ​This talk is intended for users who are using C* as their business critical database, also for users who need some sort of security compliance on their C* deployments

Real time Video Impression Store @ Netflix

Nitin Sharma - Netflix

Have you ever wondered how Netflix is able to personalize your experience so well? Knowing which recommended content was seen (impressions) while browsing Netflix is an important signal to our personalization algorithms. Building a scalable real-time distributed infrastructure to process and store billions of impressions is an onerous challenge.
In this talk, I will do a deep dive into our Apache Flink based infrastructure which processes impressions in near real time, produces 100s of TBs of data and updates EVCache, a globally replicated and distributed data store. We will also look at the impressions serving GRPC based infrastructure that guarantees millisecond latencies and how we addressed the following challenges:
● How do we guarantee exactly once real-time processing semantics for hundreds of billions of events using Apache Flink?
● How does EVCache handle large scale updates while guaranteeing tail query lookups within milliseconds for 100s of TB of data?
● How do we balance cost with high availability and data consistency?
● How do we do leverage Rocksdb in Flink to do state management and also guarantee zero data loss during infrastructure failures?

Solving IO bottlenecks in Distributed Data Systems: Cassandra Case Study

Prasanna Sundararajan / Nishant Vyas - rENIAC

Most database workloads are IO bound. In traditional data store, data has to travel through network, OS kernel, main memory, I/O subsystem and into and out of the storage media. These mediums are not optimized to fully utilize the available IO bandwidth thus the result is inefficient data flow that reduces the total IO throughput per node and force users to deploy more nodes to scale IO.
With recent advancements in FPGA enabled hardware architecture, where FPGA is having a direct access to network, cpu and storage, rENIAC has developed a simpler, more scalable I/O stack, one that is in magnitude better than the traditional data flow and well suited for advanced distributed database architectures like Cassandra DB.
We will present a case study of addressing the IO problems of Cassandra DB. the results show up to 25x lower latency and 5x higher throughput. We will also elaborate benefits to reducing the impact of compaction using our approach.
Please come join us to learn how FPGAs can increase IO throughput and data density per node.
Intended audience: Distributed Data System (e.g. Cassandra, Dynamo etc.) users and developers interested in learning about deploying and/or building hardware accelerators using FPGAs.
Technical skills and concepts required: Basic knowledge of distributed NoSQL databases and curiosity about hardware acceleration

Cassandra's Odyssey at Netflix

Roopa Tangirala - Netflix

This talk will walk through the journey of Cassandra at Netflix. It will go into 3-4 specific use cases where Cassandra stands out than the rest of the data-stores and is being used in Netflix, bringing great viewing experience to all customers globally. Roopa will go into the specifics of the data model being used and where Cassandra stands out with its strengths and which places where they learnt the hard way. Roopa will then share some of the best practices and self service platform being used for Cassandra to cater to their developer needs.

Taking advantage of the open-source to sidestep SizeTierCompactionStrategy limitations

Alexander Filipchik - Sony

Cassandra currently provides 4 perfect strategies that are designed to handle any possible use case. SizeTier, Leveled, TimeWindow and DateTiered were created long time ago and their code is pretty much static. But have you ever wondered whether we can do better? Have you ever run a cluster with SizeTier compaction constantly watching disk space to make sure it well below 50%? Maybe even asking yourself, why can't I use another 50%? If you did you might be interested in the story of running a mission critical for Sony PlayStation cluster which is deployed across multiple regions, holds more than 100Tb of data and was constant headache for our Ops team. I will go through design of the system, issues we had and will dive into fixes that I made to SizeTierCompactionStrategy to mitigate them. It will soon celebrate 2 years in production and today we have nodes living a happy lives with 80% disk full.

Relational model in Cassandra: Will it fit?

Matija Gobec - SmartCat

In the process of building a highly nested UDT structure, Matija encounter some peculiar use cases, which are the basis of this presentation. Whether its using nested UDTs, JSON in a text field or even a blob of serialized data, all of the approaches to build a relational model have their pros and cons. Matija will discuss when the relational model should be used - and when to stick to denormalized data model. He's also explain how writing highly nested structure from Spark can be a painful experience and what he learned from it.

Shard per core vs threads in databases

Avi Kivity - ScyllaDB

This talk will cover the shard per core architecture of Scylla and compare it to the traditional threaded design employed by Cassandra. The recent adoption of thread-per-core by Datastax is further evidence of the gains this architecture offers over a threaded design. We will discuss the various design decisions that make a shard-per-core design ideal and how to tackle the architecture issues that come with it.
Intended audience: Cassandra/Scylla users and developers
Technical skills and concepts required: NoSQL implementations as a user and a developer

Need for speed: Boosting Apache Cassandra's performance using Netty

Dinesh A. Joshi

Apache Cassandra 4.0 has several enhancements. One of the biggest enhancements is switching from blocking network IO using JDK Sockets to Non Blocking IO with Netty. As a result, Cassandra has seen gains in performance and efficiency. These gains translate into real world costs and allow Cassandra to scale better. This presentation will take you on a tour of the improvements of Cassandra's network layer (old & new) and help quantify the gains in real world terms.
At the end of the talk the audience will learn about the motivations behind changing Cassandra's internode and streaming communication to Netty. They will also learn how these changes significantly affect scalability and recent enhancements such as zero copy streaming over Netty that makes scaling and rebuilding cluster faster than ever before.

Cassandra and ScyllaDB at Yahoo! Japan

Shogo Hoshii - Yahoo! Japan
Murukesh Mohanan - Yahoo! Japan

Yahoo! JAPAN is one of the most successful internet service companies in Japan, not like in other countries. Shogo will introduce his company's business and its scale. He will go on to discuss the number of Cassandra nodes and clusters, and what kind of data is stored in them. He will go on to introduce some of the incidents that happened with the company's Cassandra implementation, and how they managed to handle them.
Murukesh Mohanan will go on to discuss the Yahoo! Japan NoSQL Team's evaluation of ScyllaDB as a successor of Cassandra in exceedingly heavy traffic.

Distributed Secondary Indexes in Manhattan @ Twitter's scale

Karthik Katooru - Twitter

Manhattan is Twitter's distributed, real-time Key-Value database that's behind some critical use cases including Tweets, Direct Messages. This talk will cover how we implemented scalable, distributed Secondary Indexes in Manhattan. Over the years, some use cases have evolved, that required capabilities beyond the data access APIs supported by Manhattan's data model. Some applications started adding workarounds due to the lack of native secondary indexes. We have built two types of secondary indexes - Global and Local Secondary Indexes to address different access patterns. Secondary Indexes in Manhattan are scalable, the indexes are distributed across the cluster and can be retrieved with similar time efficiency as reading base data. The talk will mainly focus on the architecture of Secondary Indexing in Manhattan, how we made the consistency level, space and time complexity highly tunable for the users.
Intended audience: Anyone with experience building distributed databases or users interested in the internals of database architecture.
Technical skills and concepts required: Introductory knowledge on the architecture of Distributed databases like Cassandra. Optional: Understanding of Consistency levels, partial failures in distributed systems.

Securing Cassandra with Vault

Carlos Rolo - Pythian

Security is one of the crucial things to take in account today. Vault is tool made to manage credentials and auditing access to them. In this presentation, Carlos will explain how to set Vault to be resilient by using Cassandra as storage, and how to use Vault to control Cassandra credentials and access.

Operational Excellence with Apache Cassandra in Enterprises

Hayato Shimizu - digitalis.io

Operational management and performance tuning of Apache Cassandra clusters in enterprises require discipline across number of areas in order to satisfy the business requirements and keeping to the Cassandra promises of always on, never down. As with any types of enterprise systems, keeping the lights on with Cassandra in production environments requires extensive set of monitoring and alerting. "Are the queries response times within the defined SLA? What were the characteristics of the cluster that lead to an event of catastrophe? What is the fastest path to recovery?" etc.
Effective monitoring of Cassandra comprises of 3 distinct types of information being actively captured from running systems - namely service availability, performance metrics, and event/logs. We started with using many open source tools to capture these and created dashboards and alerting, largely solving the problem of being able to identify Cassandra issues quickly.
Another aspect of lubricating enterprise Cassandra clusters is to implement a scheduler to perform regular repairs and backups, which in itself needs to be reliable. All of the above require many server components, various dashboards, and several agents to be deployed on Cassandra servers creating messy firewall configs, change approvals - a large number of components to maintain, test, patch, and upgrade. In this talk we will discuss how we are solving these problems to successfully manage Cassandra for our customers.

Performance at Scale on Cassandra

John Sumsion - Family Search

John works for FamilySearch on the world-wide Family Tree project that runs on Cassandra. He's an alumni speaker of both the Cassandra Summit and Apache Big Data conferences. John will be sharing what they've learned in developing the Family Tree data model -- focusing on what it took to overcome the problems encountered as their traffic increased over 5x. John's talk will cover:
- Description of Family Tree data model
- How writes are denormalized
- How reads are optimized
- How we avoided lots of tombstones [new]
- How we avoided reading partial/failed writes [new]
- Tuning for better read performance [new]
- Cluster scale warning signs [new]

Cassandra data ingestion with CDC

Jay Zhuang - Uber

Jay has spoken previously about general Cassandra usage and management at Uber. For this presentation, Jay will describe the new CDC design, and how Uber integrated it with Kafka for data ingestion and Redis for cache invalidation. The talk will cover the challenges for data ingestion from Cassandra, the current CDC design and limitations, and CDCv2 with a plugin interface. Jay will wrap up with the pros and cons, benchmark, and stress-test results. This will be a great talk for Cassandra users who need to have a data ingestion pipeline.

Running Apache Cassandra on Docker

Jim Hatcher - DataStax

Containers area a lightweight alternative to Virtual Machines for isolating applications within a single host machine. They are widely used in the web- and application-tiers of systems, but are they really appropriate for the stateful data tier? In this talk, we will explore that subject and walk through how to get Cassandra running using Docker.
Intended audience: Technical Architects
Technical skills and concepts required: Cassandra, Virtualization, Containers, Docker

Deploying Cassandra in Multi-cloud

Jeff Carpenter - DataStax

One of the recent trends in cloud computing is the tendency for enterprises to use multiple clouds, either to take advantage of unique offerings per cloud vendor or to hedge their bets. Creating a data tier that spans multiple clouds is a key consideration in these architectures. Apache Cassandra is an ideal technology for the multi-cloud data tier due to its flexibility and resilience. In this talk we’ll discuss the challenges you’ll face in developing the data tier of multi-cloud applications, provide recommendations on how to address these challenges with Cassandra, and show how we created a sample application to demonstrate these recommendations in practice.
Intended audience: developers, operators, architects
Skill level: some experience with deploying/configuring Cassandra will be useful

Getting the most out of Cassandra

Jeff Carpenter - DataStax

If you’re just getting started on your journey with Apache Cassandra, this talk is for you. We’ll talk about when (and when not) to use Cassandra and share some resources that can help you get up to speed quickly on data modeling and application development. We’ll look at how to avoid some common pitfalls, and talk about how to approach Cassandra operations. Finally, we’ll discuss some tangible ways you can give back to the community.
Intended audience: developers, operators, architects
Skill level: beginner with SW dev background