Distributed Data Summit Confirmed Sessions

We are continuing to announce the sessions. Bookmark this page and expect frequent updates.

Keynote: Apache Cassandra: The Road to 4.0

Nate McCall - Apache Software Foundation

We have come a long way in 10 years. What started as a side project to solve a specific problem using a mashup of distributed computing techniques has morphed into a mission critical database relied on by some of the worlds largest companies. The journey to this point had it's ups and downs, drama and intrigue. But it sure makes for a good story.
In this keynote, we'll tell this story, talking frankly about the changes in the community over the past couple of years and where we are now as a project. From there, we'll look out towards our future with the upcoming release of 4.0, detailing some of the latest features and improvements: technically and as a community.
Attendees of this keynote will gain an understanding of where we are as a project, how we got to this point, where we are going next and, most importantly, how they can become involved in the journey.

Cassandra's Odyssey at Netflix

Roopa Tangirala - Netflix

This talk will walk through the journey of Cassandra at Netflix. It will go into 3-4 specific use cases where Cassandra stands out than the rest of the data-stores and is being used in Netflix, bringing great viewing experience to all customers globally. Roopa will go into the specifics of the data model being used and where Cassandra stands out with its strengths and which places where they learnt the hard way. Roopa will then share some of the best practices and self service platform being used for Cassandra to cater to their developer needs.

Taking advantage of the open-source to sidestep SizeTierCompactionStrategy limitations

Alexander Filipchik - Sony

Cassandra currently provides 4 perfect strategies that are designed to handle any possible use case. SizeTier, Leveled, TimeWindow and DateTiered were created long time ago and their code is pretty much static. But have you ever wondered whether we can do better? Have you ever run a cluster with SizeTier compaction constantly watching disk space to make sure it well below 50%? Maybe even asking yourself, why can't I use another 50%? If you did you might be interested in the story of running a mission critical for Sony PlayStation cluster which is deployed across multiple regions, holds more than 100Tb of data and was constant headache for our Ops team. I will go through design of the system, issues we had and will dive into fixes that I made to SizeTierCompactionStrategy to mitigate them. It will soon celebrate 2 years in production and today we have nodes living a happy lives with 80% disk full.

Relational model in Cassandra: Will it fit?

Matija Gobec - SmartCat

In the process of building a highly nested UDT structure, Matija encounter some peculiar use cases, which are the basis of this presentation. Whether its using nested UDTs, JSON in a text field or even a blob of serialized data, all of the approaches to build a relational model have their pros and cons. Matija will discuss when the relational model should be used - and when to stick to denormalized data model. He's also explain how writing highly nested structure from Spark can be a painful experience and what he learned from it.

Need for speed: Boosting Apache Cassandra's performance using Netty

Dinesh A. Joshi

Apache Cassandra 4.0 has several enhancements. One of the biggest enhancements is switching from blocking network IO using JDK Sockets to Non Blocking IO with Netty. As a result, Cassandra has seen gains in performance and efficiency. These gains translate into real world costs and allow Cassandra to scale better. This presentation will take you on a tour of the improvements of Cassandra's network layer (old & new) and help quantify the gains in real world terms.
At the end of the talk the audience will learn about the motivations behind changing Cassandra's internode and streaming communication to Netty. They will also learn how these changes significantly affect scalability and recent enhancements such as zero copy streaming over Netty that makes scaling and rebuilding cluster faster than ever before.

Cassandra and ScyllaDB at Yahoo! Japan

Shogo Hoshii - Yahoo! Japan
Murukesh Mohanan - Yahoo! Japan

Yahoo! JAPAN is one of the most successful internet service companies in Japan, not like in other countries. Shogo will introduce his company's business and its scale. He will go on to discuss the number of Cassandra nodes and clusters, and what kind of data is stored in them. He will go on to introduce some of the incidents that happened with the company's Cassandra implementation, and how they managed to handle them.
Murukesh Mohanan will go on to discuss the Yahoo! Japan NoSQL Team's evaluation of ScyllaDB as a successor of Cassandra in exceedingly heavy traffic.

Distributed Secondary Indexes in Manhattan @ Twitter's scale

Karthik Katooru - Twitter

Manhattan is Twitter's distributed, real-time Key-Value database that's behind some critical use cases including Tweets, Direct Messages. This talk will cover how we implemented scalable, distributed Secondary Indexes in Manhattan. Over the years, some use cases have evolved, that required capabilities beyond the data access APIs supported by Manhattan's data model. Some applications started adding workarounds due to the lack of native secondary indexes. We have built two types of secondary indexes - Global and Local Secondary Indexes to address different access patterns. Secondary Indexes in Manhattan are scalable, the indexes are distributed across the cluster and can be retrieved with similar time efficiency as reading base data. The talk will mainly focus on the architecture of Secondary Indexing in Manhattan, how we made the consistency level, space and time complexity highly tunable for the users.
Intended audience
Anyone with experience building distributed databases or users interested in the internals of database architecture.
Technical skills and concepts required
Introductory knowledge on the architecture of Distributed databases like Cassandra. Optional: Understanding of Consistency levels, partial failures in distributed systems.

Securing Cassandra with Vault

Carlos Rolo - Pythian

Security is one of the crucial things to take in account today. Vault is tool made to manage credentials and auditing access to them. In this presentation, Carlos will explain how to set Vault to be resilient by using Cassandra as storage, and how to use Vault to control Cassandra credentials and access.

Operational Excellence with Apache Cassandra in Enterprises

Hayato Shimizu - digitalis.io

Operational management and performance tuning of Apache Cassandra clusters in enterprises require discipline across number of areas in order to satisfy the business requirements and keeping to the Cassandra promises of always on, never down. As with any types of enterprise systems, keeping the lights on with Cassandra in production environments requires extensive set of monitoring and alerting. "Are the queries response times within the defined SLA? What were the characteristics of the cluster that lead to an event of catastrophe? What is the fastest path to recovery?" etc.
Effective monitoring of Cassandra comprises of 3 distinct types of information being actively captured from running systems - namely service availability, performance metrics, and event/logs. We started with using many open source tools to capture these and created dashboards and alerting, largely solving the problem of being able to identify Cassandra issues quickly.
Another aspect of lubricating enterprise Cassandra clusters is to implement a scheduler to perform regular repairs and backups, which in itself needs to be reliable. All of the above require many server components, various dashboards, and several agents to be deployed on Cassandra servers creating messy firewall configs, change approvals - a large number of components to maintain, test, patch, and upgrade. In this talk we will discuss how we are solving these problems to successfully manage Cassandra for our customers.

Performance at Scale on Cassandra

John Sumsion - Family Search

John works for FamilySearch on the world-wide Family Tree project that runs on Cassandra. He's an alumni speaker of both the Cassandra Summit and Apache Big Data conferences. John will be sharing what they've learned in developing the Family Tree data model -- focusing on what it took to overcome the problems encountered as their traffic increased over 5x. John's talk will cover:
- Description of Family Tree data model
- How writes are denormalized
- How reads are optimized
- How we avoided lots of tombstones [new]
- How we avoided reading partial/failed writes [new]
- Tuning for better read performance [new]
- Cluster scale warning signs [new]

Cassandra data ingestion with CDC

Jay Zhuang - Uber

Jay has spoken previously about general Cassandra usage and management at Uber. For this presentation, Jay will describe the new CDC design, and how Uber integrated it with Kafka for data ingestion and Redis for cache invalidation. The talk will cover the challenges for data ingestion from Cassandra, the current CDC design and limitations, and CDCv2 with a plugin interface. Jay will wrap up with the pros and cons, benchmark, and stress-test results. This will be a great talk for Cassandra users who need to have a data ingestion pipeline.

Running Apache Cassandra on Docker

Jim Hatcher - DataStax

Containers area a lightweight alternative to Virtual Machines for isolating applications within a single host machine. They are widely used in the web- and application-tiers of systems, but are they really appropriate for the stateful data tier? In this talk, we will explore that subject and walk through how to get Cassandra running using Docker.
Intended audience: Technical Architects
Technical skills and concepts required: Cassandra, Virtualization, Containers, Docker

Deploying Cassandra in Multi-cloud

Jeff Carpenter - DataStax

One of the recent trends in cloud computing is the tendency for enterprises to use multiple clouds, either to take advantage of unique offerings per cloud vendor or to hedge their bets. Creating a data tier that spans multiple clouds is a key consideration in these architectures. Apache Cassandra is an ideal technology for the multi-cloud data tier due to its flexibility and resilience. In this talk we’ll discuss the challenges you’ll face in developing the data tier of multi-cloud applications, provide recommendations on how to address these challenges with Cassandra, and show how we created a sample application to demonstrate these recommendations in practice.
Intended audience: developers, operators, architects
Skill level: some experience with deploying/configuring Cassandra will be useful

Getting the most out of Cassandra

Jeff Carpenter - DataStax

If you’re just getting started on your journey with Apache Cassandra, this talk is for you. We’ll talk about when (and when not) to use Cassandra and share some resources that can help you get up to speed quickly on data modeling and application development. We’ll look at how to avoid some common pitfalls, and talk about how to approach Cassandra operations. Finally, we’ll discuss some tangible ways you can give back to the community.
Intended audience: developers, operators, architects
Skill level: beginner with SW dev background