Threads without Locks

Russ Cox

http://swtch.com/~rsc/talks/

Second International Plan 9 Workshop

December 2007

Introduction

Two popular camps for designing concurrent programs.

Threads with locks
- Birrell, “An introduction to programming with threads”
Top-level select loops with events
- Ousterhout, “Why threads are a bad idea (for most purposes)”

Should we use threads or events?

Andrew Birrell: threads.
John Ousterhout: events.
John DeTreville: no.
Rob Pike: yes.

Bell Labs approach: threads and events.

Squeak, Newsqueak, Alef, Limbo

Threads vs. Events

Drawbacks of threads (Ousterhout):

synchronization, deadlock, hard to debug
library layering, callbacks complicated by locks
good performance is hard, threads not well supported
“often don't want concurrency anyway”

Drawbacks of events (Ousterhout):

long-running handlers make apps non-responsive
can't maintain local state across events
no CPU concurrency
event-driven I/O not always well supported

Actually drawbacks of locks and top-level select loops!

The “Plan 9” Way: Threads and Events

Model:

Each thread maintains private, local state.
Threads interact by communication (messages, aka events).
No mutable global variables.
Thread creation, context switch, message sends are very cheap
Shared data sits behind protocol-serving threads

Benefits:

no locking problems
can maintain local state across events
can use CPU concurrency
long-running handlers don't stop other threads
don't need event-driven I/O support

Talk content

At the end, I hope you understand:

why “threads vs. events” is a nonsensical question
how to think about programming threads using events
how to use threads with events to build simple programs
some of the flavor of Newsqueak, Alef, and Limbo programs
some of the history behind the Bell Labs approach to concurrency

For more information:

lots of links throughout talk (in red)
talk is online; click on the links
http://swtch.com/~rsc/talks/threads07/

Outline

Introduction and motivation

Primitives

Examples

venti indexer
publish/subscribe

Comparison with locking, select loops

History and related topics

Cooperative (non-preemptive) scheduling

Primitives

Pseudocode in this talk

Essential syntax common to Newsqueak, Alef, Limbo
Translation to C and libthread at end.

Process creation

proc f(x, y, z)
- Creates new proc (thread) executing f(x, y, z).

Channel: the communication abstraction

chan of int
- Unidirectional channel for sending and receiving ints.
c = chan of int
- Allocate unbuffered channel
c = chan[5] of int
- Allocate buffered channel (queue)

Primitives - send and receive

Send

c <-= x;
- Send the value x on c.

Receive

x = <-c;
- Receive a value from c and assign it to x.

Semantics:

Guaranteed in-order delivery.
Receive blocks until there is data to receive.
Send on unbuffered channel blocks until receiver comes along.
Send on buffered channel blocks until data is placed in buffer.
Multiple readers, multiple writers okay.

Primitives - alt

Alt (aka select)

    alt {
    x = <-c1:
        print("received %d from c1\n", x);
    c2 <-= y:
        print("sent %d to c2\n", y);
    }

Blocks until at least one of the communcations can proceed.
Executes one of the possible communications followed by its code block.
Advanced form: cond && c <-= y only eligible when cond is true.

Warm-up: concurrent prime sieve (McIlroy)

Can generate primes using a pipeline of procs.

    p = receive from left neighbor
    print p
    loop:
        x = receive from left neighbor
        if (p does not divide x)
            send x to right neighbor

Generating process sends 2, 3, 4, 5, 6, ... to leftmost proc.

Warm-up: code

Implementation in our pseudo-language.

    filter(c: chan of int)
    {
        p = <-c;
        print("%d\n", p);
    
        c1 = chan of int;
        proc filter(c1);
        while(x = <-c)
            if(x%p)
                c1 <-= x;
    }
      
    c = chan of int;
    proc filter(c);
    for(i=2;; i++)
        c <-= i;

Not representative of systems uses; hopefully mind-expanding.

Example: venti indexer

Background

Venti stores data across many arena disks, index disks.
Indexer's job is to read each block on arena disks and store index entries on the index disks.
Assignment of arena block to index disk is (pseudo)random.

For performance, want to keep all the disks busy.

Read from all arena disks simultaneously.
Write to all index disks simultaneously.

With locks?

one reader per arena, using locks to synchronize index writes?
???

Example: venti indexer

Channel-based solution:

One thread per arena disk, reading.
One thread per index disk, accumulating index writes and writing.
One channel per index disk.
Arena thread writes each entry to appropriate index disk's channel.
Index thread reads its disk's channel.

Arena reader:

    arenadiskproc(d: ArenaDisk)
    {
        for(each block b in d){
            xd = indexdisk(b);
            xd.c <-= indexentry(b);
        }
    }

Example: venti indexer (source code)

Arena reader:

    arenadiskproc(d: ArenaDisk)
    {
        for(each block b in d){
            xd = indexdisk(b);
            xd.c <-= indexentry(b);
        }
    }

Index writer:

    indexdiskproc(d: IndexDisk)
    {
        nentries = 0;
        while(ie = <-d.c){
            entries[nentries++] = ie;
            if(nentries == nelem(entries)){
                bufwrite(entries, nentries);
                nentries = 0;
            }
        }
        bufwrite(entries, nentries);
    }

Example: publish/subscribe

Pub/sub server publishes pair of channels, for publish and subscribe.

    typedef message = (string, string);  /* (topic, body) */
    struct Server
    {
        csub: chan of (string, chan of message);
        cpub: chan of message;
    };

Subscriber registers a channel with its topic of interest.

    c = chan of message;
    server.csub <- ("slashdot", c);
    for(;;){
        (t, s) = <-c;
        print("slashdot: %s\n", s);
    }

Publisher sends messages on server.cpub.

    server.cpub <- ("slashdot", "l337 hax0r dudes");

Example: pub/sub server

Event loop handling just the two server channels.

    subscribers = list of (string, chan of message);
    for(;;)
        alt {
        (topic, c) = <-csub:
            append(subscribers, (topic, c));
        (topic, body) = <-cpub:
            for((t, c) in subscribers)
                if(t == topic)
                    c <-= (topic, body);
        }

Problem: publishing can block if client stops reading.

Example: pub/sub server

Introduce new per-client thread to take care of queueing messages.

    buffer(dst: chan of message): chan of message
    {
        c = chan of message;
        proc bufferproc(c, dst);
        return c;
    }
    
    bufferproc(src: chan of message, dst: chan of message)
    {
        messages = list of string;
        for(;;)
            alt {
            msg <-= src:
                append(messages, msg);  
            len(messages) > 0 && dst <-= messages[0]:
                pop(messages);
            }
    }

Server loop barely changes, because interface is the same.

    (topic, c) = <-csub:
        append(subscribers, (topic, buffer(c)));

Example: pub/sub server proxy

    proxy(remote: Server): Server
    {
        new_csub = chan of (string, chan of message);
        local = Server(new_csub, remote.cpub);
        proc proxyproc(remote, local);
        return local;
    }
    
    proxyproc(remote: Server, local: Server)
    {
        topics = list of string;
        subscribers = list of (string, chan of message);
        from_remote = chan of message;
    
        for(;;)
            alt {
            (topic, c) = <-local.csub:
                append(subscribers, (topic, c));
                if(topic not in topics){
                    append(topics, topic);
                    remote.csub <- (topic, from_remote);
                }
            (topic, body) = <-from_remote:
                for((t, c) in subscribers)
                    if(t == topic)
                        c <-= (topic, body);
            }
    }

Pub/sub lessons

Communication abstraction makes both sides simple.

Have server loop and client loop.
Thread & lock solutions much more complicated.
- Subscription registers a callback?
- If one thread publishes, what thread do the callbacks run in?
- If subscription has get() method, how does that block and synchronize with publishers?

Channel (or set of channels) is an interface.

Defines exactly how to interact with a given resource.
Easy to use filters that preserve that interface.
- buffer, proxy
Rob Pike built a window system as a channel filter on top of the usual hardware screen interface.

Outline

Introduction and motivation

Primitives

Examples

venti indexer
publish/subscribe

Comparison with locking, select loops

History and related topics

Cooperative (non-preemptive) scheduling

Comparison with Locks

Locks protect shared mutable state.

Specifically, they protect invariants about the state.
lock(x) says “I depend on the invariants protected by x.”
or “I'm about to break the invariants protected by x. Look away!”
unlock(x) says “I'm done (and if I broke the invariants, I've fixed them).”
Invariants always existed, just easier to break with multithreading.
Locks unnecessary in single-threaded program.

Message-passing moves mutable state into single-threaded loop.

Invariants still exist, like in single-threaded code.
But the server thread is itself single-threaded, easier to reason about.
Can inspect individual server threads for correctness.
No need to worry about other threads — they can't see into our state.

Comparison with event loops

Top-level event loops provide one single thread of execution.

Single-threaded, so easy to reason about.
Handlers might have their own constrants, simulating logical threads.
Handlers can't maintain local state (stack variables).
Handlers can't block easily (stack ripping).

Threaded message-passing provides many single threads of execution.

(If one event loop is good, many is better!)
Each is single-threaded and isolated, so easy to reason about.
Protocol for interacting with other threads is specified.
Each message loop can maintain local state, can block if needed.

History: Rumblings

Doug McIlroy, 1964

“We should have some ways of coupling programs like garden hose--screw in another segment when it becomes necessary to massage data in another way. This is the way of IO also.”

Doug McIlroy, 1968

“Coroutines: semantics in search of a syntax”
Unpublished Bell Labs memorandum
[McIlroy] “It was clearly a beautiful mental model, this idea that the output from one process would just feed in as input to another. There was syntactic difficulty in talking about that mental model, and I have a co-routine paper written in 1968 that was never, never printed, because it was always a little too ugly, struggling with syntax.”

History: Pipes

Thompson recalls:

“Yeah,well, Doug had was for years and years, well it seemed like years, I don't know the actual span was probably one year, Doug had uh, and he talked to us continually about it, a notion of interconnecting computers in grids, and arrays, you know very complex, you know, and there were always problems in his proposals. That what you would type would be linear and what he wanted was three-dimensional...n-dimensional...I mean he wanted just topological connection of programs and to build programs with loops and and you know horrid things. I mean he had such grandiose ideas and we were just saying, you know, “God, it's worthless, the complexity you're generating just can't be fathomed. You don't sit down and you don't type these kind of connections together.” ”

History: Pipes

Thompson continued:

“And [Doug] persisted with his the grandiose ideas where you get into Kirchoff's law problems, where you get into you know, what happens if you have a feedback loop and every program doubles the number of characters, you know, it reads one and writes two? You know, what happens to...it's got to go somewhere you know. And you get these synchronization just, I mean there's just no way to implement his ideas and we kept trying to pare him down and weed him down and get him down, you know, and get something useful and distill it. What was going on, what was needed, what was real ideas, what was the fantasy of his ...and we there were constant discussions all through this period, and it hit just one night, it just hit, and they went in instantly, I mean they are utterly trivial.”

History: Pipes

McIlroy recalls:

“Over a period from 1970 til '72, I'd from time to time, say “How about making something like this?”, and I would put up another proposal, another proposal, another proposal. Then one day I came up with a syntax for the shell that went along with the piping and Ken said, “I'm gonna do it.” He was tired of hearing all this stuff....
“That was absolutely a fabulous day, the next day....
“[Ken] put pipes into Unix. He put this notation into the shell, all in one night.... Most of the programs up until that time couldn't take standard input, because, there wasn't the real need. They all had file arguments. grep had a file argument, cat had a file argument. Thompson saw that that wasn't going to fit into this scheme of things, and he went in and changed all those programs in the same night. I don't know how. And the next morning we had this orgy of “one liners.” Everybody had another one liner. Look at this, look at that.”

Side note: prime sieve in shell script

    #!/bin/rc
    
    filter()
    {
        read p || exit
        while read x; do
            if [ `expr $x % $p` != 0 ]; then
                echo $x
            fi
        done | filter
    }
    
    seq 2 100 | filter

History: Communicating Sequential Processes (CSP)

Tony Hoare, 1978

“Communicating sequential processes,” CACM, August 1978
Later a book, but the paper is far more concise.

Luca Cardelli and Rob Pike, 1985

“Squeak: a Language for Communicating with Mice,” SIGGRAPH 1985

Rob Pike, 1989

“Newsqueak: a Language for Communicating with Mice,” Computing Science Technical Report 143, AT&T Bell Laboratories, Murray Hill, 1989.
“A Concurrent Window System,” Computing Systems 2(2) 133-153.

Result is like pipes but send typed messages instead of text.

Implementation is more efficient (user space)

History: Later languages

Phil Winterbottom, Alef, 1991 or so

Sean Dorward and Phil Winterbottom, Limbo, 1997 or so

Rob Pike's 2007 Google tech talk
Concurrency and message passing in Newsqueak

Related languages: Erlang

Erlang is all message-passing, completely functional.

Developed at Ericsson telecom starting in late 1980s.
No way to create shared mutable data.
Used in many telecommunications apps, with one thread per call!
Starting to get traction for other network programming.

Like in CSP, Erlang does not have first-class channels.

Sends are targeted at proc ids.
Receives can restrict acceptable messages with pattern matching.

Erlang message sends are asynchronous, unreliable, unordered!

Perhaps a better model for real world networks.
Makes it easier to express real world failures.

Related languages: Haskell

Haskell is a lazy functional programming language.

On the surface, nothing to do with concurrency.
Lazy computation ends up creating lots of parallel threads of control.

Concurrent prime sieve uses threads to implement lazy streams.

Doug McIlroy, “Squinting at Power Series,” Software—Practice and Experience, 20(7) (July 1990), 661-683.

If language supports laziness directly, can write even cleaner programs.

Doug McIlroy, “Power series, power serious,” Functional Pearl, 1999.

Doug McIlroy, “The music of streams,” Information Processing Letters 77 (2001) 189-195.

Plan 9 thread library cheat sheet

    #include <thread.h>
    
    Channel *c;  
    c = chancreate(sizeof(int), 0);   /* c = chan of int */
    c = chancreate(sizeof(int), 5);   /* c = chan[5] of int */
    
    recv(c, &x);                      /* x = <-c */
    x = recvp(c);
    x = recvul(c);
    
    send(c, &x);                      /* c <-= x */
    sendp(c, x);
    sendul(c, x);

Plan 9 thread library cheat sheet - alt

    alt {
    x = <-c1:
        print("received %d from c1\n", x);
    cond && c2 <-= y:
        print("sent %d to c2\n", y);
    }

C + Libthread:

    Alt alts[3];
    
    alts[0] = (Alt){c1, &x, CHANRCV};
    alts[1] = (Alt){c2, &y, cond ? CHANSND : CHANNOP};
    alts[2] = (Alt){nil, nil, CHANEND};
    switch(alt(alts)){
    case 0:
        print("received %d from c1\n", x);
        break;
    case 1:
        print("sent %d to c2\n", y);
        break;
    }

Plan 9 thread library cheat sheet - proc

    proc f(x, y, z);

C + Libthread:

    struct Args
    {
        int x;
        int y;
        int z;
    };
    
    a = malloc(sizeof *a);
    *a = (Args){ x, y, z };
    proccreate(run_f, a, STACK);
    
    void
    run_f(void *v)
    {
        Args a = *(Args*)v;
        free(v);
        f(a.x, a.y, a.z);
    }

Duality

Lauer and Needham, 1979:

Many operating system designs can be placed into one of two very rough categories, depending upon how they implement and use the notions of process and synchronization. One category, the “Message-oriented System,” is characterized by a relatively small, static number of processes with an explicit message system for communicating among them. The other category, the “Procedure-oriented system,” is characterized by a large, rapidly changing number of small processes and a process synchronization mechanism based on shared data.

In this paper, it is demonstrated that these two categories are duals of each other and that a system which is constructed according to one model has a direct counterpart in the other. The principal conclusion is that neither model is inherently preferable, and the main consideration for choosing between them is the nature of the machine architecture upon which the system is being built, not the application which the system will ultimately support.

Duality, continued

Lauer and Needham's terminology

Traditional threads and locks == “Procedure-oriented system”
Threads with message-passing == “Message-oriented system”
One big event loop with message-passing == ???

Libevent, libasync, etc. are not dual to threads and locks. They are crippled “Message-oriented systems”!

Alef, which provides locks and channels (and a lot of other things), is dual to itself!

Duality, continued.

If they are equivalent, why choose one over the other?

“the main consideration for choosing between them is the nature of the machine architecture upon which the system is being built”
Channels are a good model of the external world.
As we get more and more parallelism, exposing communication may be a big architectural win.

Complexity issues

Using channels acts as a complexity damper in a way that locks do not.
Too easy to just throw in a few locks.
A little more work to convert single-threaded code to channels.

If all you have is locks, use them to build multiple-reader, multiple-writer finite queues.

(This is what Plan 9's libthread does too.)

Summary

The “threads vs. events” question is misphrased.

Locks and top-level event loops are both evil.

Threads with events (message passing) works very well.

exposes interfaces, often composable ones
can produce simpler, easier to understand programs

Plan 9 thread library provides necessary primitives.

Go forth and sin no more.

http://swtch.com/~rsc/talks/

Cooperative scheduling

Alef introduced cooperatively-scheduled tasks.

Tasks are hard-wired to a particular proc.
In a proc, only one task runs at a time.
Must block explicitly (send, receive, alt).
System calls block entire proc, don't reschedule task.
Tasks within a proc can share data in tiny critical sections without locks (as long as sections cannot block).
Tasks in other procs still run independently.

Typical use: procs for I/O, with one main proc for all tasks.

Example: acme fsys proc

Acme's implementation relies heavily on tasks.

Almost all tasks in main proc, sharing data, no locks.
9P protocol I/O (blocking system calls) must be in separate proc.
9P service logic needs to run in the main proc.

Solution: a pool of execution server tasks running in the main proc:

    Xfid.ctl(Xfid *x)
    {
        for(;;){
            (*<-x->c)(x);     /* f = <- x->c; f(x); */
            bflush();         /* update screen */
            cxfidfree <-= x;  /* make x available for reuse */
        }
    }

Actually many of these tasks, allocated on demand.

Example: acme fsys proc

Rob Pike, Acme: a User Interface for Programmers

“Although this sequence may seem complicated, it is just a few lines of code and is in fact far simpler than the management of the I/O queues in 8½. The hard work of synchronization is done by the Alef run time system. Moreover, the code worked the first time, which cannot be said for the code in 8½.”

Cooperative scheduling: a warning

Cooperative scheduling is convenient, seductive.

no need to worry about concurrency, locks
can safely share data between tasks in a single proc

Also subtle, precarious.

might accidentally call a function that blocks, breaking up what needs to be an atomic section
system call I/O (blocks proc) is different from channel I/O (reschedules task)
- Plan 9 from User Space adds a “turn off task rescheduling in this proc” primitive to mimic system call I/O even when using channels.

Ultimately, perhaps not worth the trouble. But there it is.

tempting to reintroduce locks anyway (p2psim)
Dorward and Winterbottom dropped tasks when designing Limbo.

Threads without Locks

Second International Plan 9 WorkshopDecember 2007

Introduction

Threads vs. Events

The “Plan 9” Way: Threads and Events

Talk content

Outline

Primitives

Primitives - send and receive

Primitives - alt

Warm-up: concurrent prime sieve (McIlroy)

Warm-up: code

Example: venti indexer

Example: venti indexer

Example: venti indexer (source code)

Example: publish/subscribe

Example: pub/sub server

Example: pub/sub server

Example: pub/sub server proxy

Pub/sub lessons

Outline

Comparison with Locks

Comparison with event loops

History: Rumblings

History: Pipes

History: Pipes

History: Pipes

Side note: prime sieve in shell script

History: Communicating Sequential Processes (CSP)

History: Later languages

Related languages: Erlang

Related languages: Haskell

Plan 9 thread library cheat sheet

Plan 9 thread library cheat sheet - alt

Plan 9 thread library cheat sheet - proc

Duality

Duality, continued

Duality, continued.

Summary

Cooperative scheduling

Example: acme fsys proc

Example: acme fsys proc

Cooperative scheduling: a warning

Second International Plan 9 Workshop

December 2007