CS 294-91 Distributed Computing

Winter 2013

UC Berkeley

This course provides basic theoretical and practical foundations of distributed systems. Students learn about system models, safety and liveness of protocols, different failure models, reliable group communication abstractions, and more. It utilizes a textbook and additional research paper-based lectures.

17 covered concepts

Slides / notes available

No videos available

Assignments available

Other resources available

Course Page

Overview

In the past decade evermore applications and services, which previously were running on local PCs, have moved to the Internet, in data centers, accessible through the Web. This puts distributed systems at the center of many of s application architectures. Distributed systems (or distributed computing) concerns systems in which many nodes (machines) solve a common problem, using message passing over a network that connects those nodes. The aim of this course is to establish familiarity with the basic theoretical and practical foundations of distributed systems.

Distributed computing is challenging due to two fundamental problems: (i) partial-failures, and (ii) asynchrony. Partial failures means that parts of the system (network or machines) can be faulty, but it is desirable for the rest of the system to function correctly. Asynchrony is due to the variance in the time it takes to send messages between computers and the operating speed of different computers. It is therefore desirable to make the system function correctly while events are happening asynchronously.

Over the years, many recurring problems have been studied with respect to the two aforementioned challenges. Furthermore, many abstractions have been proposed that simplify dealing with these two challenges when building distributed systems. In this course we will study many of these problems and abstractions, including the following: today

Models of distributed systems
Safety and liveness of distributed protocols
Different failure models for distributed systems (fail-stop, fail-noisy, Byzantine)
Reliable group communication abstractions (reliable, atomic, etc)
Shared memory and consistency models (linearizable, regular etc)
Failure detectors and their relationship and implementation in real systems
Impossibility of Consensus
Consensus and Paxos
Replicated State Machines and Reconfiguration
Byzantine Fault Tolerance

Prerequisites

No data.

Learning objectives

No data.

Textbooks and other notes

We will loosely follow the following textbook, but also have additional lectures based on research papers:

Introduction to Reliable and Secure Distributed Programming, C. Cachin, R. Guerraoui, L. Rodrigues, Springer, 2011.

Other courses in Parallelism and Distributed Systems

Courseware availability

Lecture slides available at Lectures

No videos available

Homework available at Homework

Resources available at Resources

Covered concepts

Byzantine Fault Tolerance Causal Broadcast Consensus Consensus Algorithms Events and Links Failure Detectors Fast Consensus Formal Models Impossibility of Consensus Non-blocking Atomic Commit Paxos Probabilistic Consensus Reliable Broadcast Primitives Replicated State Machines and Reconfiguration Safety and liveness of distributed protocols Shared Memory/Data Consistency Terminating Reliable Broadcast (TRB)

About Feedback

Discord

CS 294-91 Distributed Computing

Overview

Prerequisites

Learning objectives

Textbooks and other notes

Other courses in Parallelism and Distributed Systems

CS 267: Applications of Parallel Computers

CS 256 Formal Methods for Reactive Systems

CS 149 PARALLEL COMPUTING

CSE 452 Distributed Systems

Courseware availability

Covered concepts