Computer Science
>
>

CS 294-91 Distributed Computing

Winter 2013

UC Berkeley

This course provides basic theoretical and practical foundations of distributed systems. Students learn about system models, safety and liveness of protocols, different failure models, reliable group communication abstractions, and more. It utilizes a textbook and additional research paper-based lectures.

Course Page

Overview

In the past decade evermore applications and services, which previously were running on local PCs, have moved to the Internet, in data centers, accessible through the Web. This puts distributed systems at the center of many of s application architectures. Distributed systems (or distributed computing) concerns systems in which many nodes (machines) solve a common problem, using message passing over a network that connects those nodes. The aim of this course is to establish familiarity with the basic theoretical and practical foundations of distributed systems.

Distributed computing is challenging due to two fundamental problems: (i) partial-failures, and (ii) asynchrony. Partial failures means that parts of the system (network or machines) can be faulty, but it is desirable for the rest of the system to function correctly. Asynchrony is due to the variance in the time it takes to send messages between computers and the operating speed of different computers. It is therefore desirable to make the system function correctly while events are happening asynchronously.

Over the years, many recurring problems have been studied with respect to the two aforementioned challenges. Furthermore, many abstractions have been proposed that simplify dealing with these two challenges when building distributed systems. In this course we will study many of these problems and abstractions, including the following: today

  • Models of distributed systems
  • Safety and liveness of distributed protocols
  • Different failure models for distributed systems (fail-stop, fail-noisy, Byzantine)
  • Reliable group communication abstractions (reliable, atomic, etc)
  • Shared memory and consistency models (linearizable, regular etc)
  • Failure detectors and their relationship and implementation in real systems
  • Impossibility of Consensus
  • Consensus and Paxos
  • Replicated State Machines and Reconfiguration
  • Byzantine Fault Tolerance

Prerequisites

No data.

Learning objectives

No data.

Textbooks and other notes

We will loosely follow the following textbook, but also have additional lectures based on research papers:

Introduction to Reliable and Secure Distributed Programming, C. Cachin, R. Guerraoui, L. Rodrigues, Springer, 2011.

Other courses in Parallelism and Distributed Systems

CS 256 Formal Methods for Reactive Systems

Winter 2023

Stanford University

CS 149 PARALLEL COMPUTING

Fall 2022

Stanford University

CSE 452 Distributed Systems

Winter 2022

University of Washington

Courseware availability

Lecture slides available at Lectures

No videos available

Homework available at Homework

Resources available at Resources

Covered concepts