神刀安全网

Locking in WebKit

Back inAugust 2015 we replaced all spinlocks and OS-provided mutexes in WebKit with the new WTF::Lock (WTF stands for Web Template Framework). We also replaced all OS-provided condition variables with WTF::Condition . These new primitives have some cool properties:

  1. WTF::Lock and WTF::Condition only require one byte of storage each. WTF::Lock only needs two bits in that byte. The small size encourages using huge numbers of very fine-grained locks. OS mutexes often require 64 bytes or more. The small size of WTF::Lock means that there’s rarely an excuse for not having one, or even multiple, fine-grained locks in any object that has things that need to be synchronized.
  2. WTF::Lock is super fast in the case that matters most: uncontended lock acquisition. Parallel algorithms tend to avoid contention by having many fine-grained locks. This means that a mature parallel algorithm will have many uncontended lock acquisitions – that is, calls to lock() when the lock is not held, and calls to unlock() when nobody is waiting in line. Similarly, WTF::Condition optimizes for the common case of calling notify when no threads are waiting.
  3. WTF::Lock is fast under microcontention. A microcontended lock is one that is contended and the critical section is short. This means that shortly after any failing lock attempt, the lock will become available again since no thread will hold the lock for long. This is the most common kind of contention in parallel code, since it’s common to go to great pains to do very little work while holding a lock.
  4. WTF::Lock doesn’t waste CPU cycles when a lock is held for a long time. WTF::Lock is adaptive : it changes its strategy for how to wait for the lock to become available based on how long it has been trying. If the lock doesn’t become available promptly, WTF::Lock will suspend the calling thread until the lock becomes available.

Compared to OS-provided locks like pthread_mutex , WTF::Lock is 64 times smaller and up to 180 times faster. Compared to OS-provided condition variables like pthread_cond , WTF::Condition is 64 times smaller. Using WTF::Lock instead of pthread_mutex means that WebKit is 10% faster on JetStream , 5% faster on Speedometer , and 5% faster on our page loading test.

Making WTF::Lock and WTF::Condition fit in one byte is not easy and the technique behind this has multiple layers. Lock and Condition offload all thread queueing and suspending functionality to WTF::ParkingLot , which manages thread queues keyed by the addresses of locks. The design of ParkingLot is inspired by futexes . ParkingLot is a portable user-level implementation of the core futex API. ParkingLot also needs fast locks, so we built it its own lock called WTF::WordLock .

This post starts by describing some background on locking. Then we describe the ParkingLot abstraction and how we use this to build WTF::Lock and WTF::Condition . This section also shows some alternate locking algorithms on top of ParkingLot . Then we describe how ParkingLot and WordLock are implemented, hopefully in enough detail to allow for meaningful scrutiny. The post concludes with some performance data, including comparisons to a bunch of lock algorithms.

Background

This section describes some background about locks. This includes the atomic operations that we use to implement locks as well as some classic algorithms like spinlocks and adaptive locks. This section also tries to give appropriate shout-outs to other lock implementations.

Atomic Operations

CPUs and programming languages provide few guarantees about the way that memory accesses interact with each other when executed concurrently, since the expectation is that programmers will prevent concurrent accesses to the same memory by guarding them with locks. But if you’re like me and you like to implement your own locks, you’ll want some lower-level primitives that do have strong guarantees.

C++ provides std::atomic for this purpose. You can use it to wrap a primitive type (like char , int , or a pointer type) with some atomicity guarantees. Provided you stick to the strongest memory ordering ( seq_cst ), you can be sure that the concurrent executions of std::atomic operations will behave as if they were executed sequentially. This makes it possible to consider whether an algorithm is sound even when run concurrently by envisioning all of the possible ways that its various atomic operations could interleave.

In WebKit, we use our own wrapper called WTF::Atomic . It’s a stripped-down version of what std::atomic provides. We’ll just consider three of its atomic methods: T load() , store(T) , and bool compareExchangeWeak(T expected, T desired) .

Load and store are self-explanatory. The interesting one is compareExchangeWeak , which implements the atomic CAS (compare-and-swap) operation. This is a CPU primitive that can be thought of as running the following pseudocode atomically:

bool CAS(T* pointer, T expected, T desired) {     if (*pointer != expected)         return false;     *pointer = desired;     return true; } 

This method is implemented in terms of std::atomic::compare_exchange_weak , which is implemented in terms of the lock; cmpxchg instruction on x86 or in terms of ldrex and strex on ARM. This form of CAS is called weak because we allow it to spuriously do nothing and return false. The opposite is not true – if the CAS returns true, then it must be that during its atomic execution, it saw *pointer being equal to expected and then it changed it to desired .

Spinlocks

Armed with WTF::Atomic , we can implement the simplest kind of lock, called a spinlock . Here’s what it looks like:

class Spinlock { public:     Spinlock()     {         m_isLocked.store(false);     }      void lock()     {         while (!m_isLocked.compareExchangeWeak(false, true)) { }     }      void unlock()     {         m_isLocked.store(false);     } private:     Atomic<bool> m_isLocked; }; 

This works because the only way that lock() can succeed is if compareExchangeWeak returns true. If it returns true, then it must be that this thread observed m_isLocked being false and then instantly flipped it to true before any other thread could also see that it had been false. Therefore, no other thread could be holding a lock. Either no other thread is calling lock() at all, or their calls to lock() are still in the while loop because compareExchangeWeak is continuing to return false.

Adaptive Locks

Adaptive locks spin only for a little bit and then suspend the current thread until some other thread calls unlock() . This guarantees that if a thread has to wait for a lock for a long time, it will do so quietly. This is a desirable guarantee for conserving CPU time and power.

By contrast, spinlocks can only handle contention by spinning. The simplest spinlocks will make contending threads appear to be very busy and so the OS will make sure to schedule them. This will waste tons of power and starve other threads that may really be able to do useful work. A simple solution would be to make the spinlock sleep – maybe for a millisecond – in between CAS attempts. This turns out to hurt performance on real code because it postpones progress when the lock becomes available during the sleep interval. Also, sleeping doesn’t completely solve the problem of inefficiency when spinning. WebKit has some locks that may be held for a long time. For example, the lock used to control the interleaving between compiler threads and the garbage collector is usually uncontended but may sometimes be held, and contended, for the entire time that it takes to collect the JS heap or the entire time that it takes to compile some function. In the most extreme case, a lock may protect blocking IO. That could happen unexpectedly, for example if a page fault on an innocent-looking load leads to swapping. Some critical sections can take a while and we don’t want contending threads to poll during that time, even if it’s only 1KHz.

There’s no good way to make spinning efficient. If we increase the delay between CAS attempts then we’re just increasing the delay between when the lock gets unlocked and when a contending thread can get it. If we decrease the delay, the lock becomes less efficient. We want a locking algorithm that ensures that if spinning doesn’t quickly give us the lock, our thread will quietly wait until exactly the moment when the lock is released. This is what adaptive locks try to do.

The kinds of adaptive locks that we will implement can be split into two separate data structures:

  1. Some small, atomic field in memory that summarizes the lock’s state. It can answer questions like, “does anyone hold the lock?” and “is anyone waiting to get the lock?” We will call this the atomic state .
  2. A queue of threads that are waiting to get the lock, and a mechanism for suspending and resuming those threads. The queue must have its own synchronization primitives and some way to be kept in sync with the atomic state. We say parking to mean enqueuing a thread and suspending it. We say unparking to mean dequeuing a thread and resuming it.

WTF::Lock is WebKit’s implementation of an adaptive lock, optimized for the things that we care about most: low space usage and a great fast path for uncontended lock acquisition. The lock object contains only the atomic state, while the queue is created on-demand inside the ParkingLot . This allows our locks to only require two bits. Like other adaptive locks, WTF::Lock provides a guarantee that if a thread has to wait a long time for a lock, it will do so quietly.

Related Work

If you know that you need an adaptive lock, you can be sure that the mutex implementation on modern OSes will safely adapt to contention and avoid spinning. Unfortunately, most of those OS mutexes will be slower and bigger than a spinlock because of artificial constraints that arise out of compatibility (a pthread_mutex_t is 64 bytes because of binary compatibility with programs that were compiled against ancient implementations that had to be 64 bytes) or the need to support features that you may not need (like recursive locking – even if you don’t use it, pthread_mutex_lock() may have to at least do a check to see if you asked for it).

WebKit’s locking infrastructure is most inspired by Linux’s excellent futex primitive. Futexes empower developers to write their own adaptive locks in just a few lines of code ( Franke, Russell, and Kirkwood ’02 ). Like with futexes, we materialize most of the data for a lock only when that lock experiences contention, and we locate that data by hashing the address of the lock. Unlike futexes, our implementation does not rely on kernel support and so it will work on any OS. The ParkingLot API has some functionality that futexes lack, like invoking callbacks while holding internal ParkingLot locks.

The idea of using very few bits per adaptive lock is widespread, especially in Java virtual machines. For example, HotSpot usually only needs two or three bits for the state of an object’s lock . I’ve co-authored a paper on locking in another Java VM , which also compressed an adaptive lock into a handful of bits. We can trace some of the ideas about how to build locks that are small and fast to meta-locks ( Agesen et al ’99 ) and tasuki locks ( Onodera and Kawachiya ’99 ).

New proposals like std::synchronic and hardware transactional memory also seek to speed up locking. We will show that these techniques don’t exhibit the performance qualities that we want for WebKit.

Despite the basic techniques being well understood in certain communities, it’s hard to find a lock implementation for C++ that has the qualities we want. Spinlocks are widely available, and those are often optimized for using little memory and having great fast paths for uncontended lock acquisition and microcontention. But spinlocks will waste CPU time when the lock isn’t immediately available. OS mutexes know how to suspend threads if the lock is not available, so they are more efficient under contention – but they usually have a slower uncontended fast path, they don’t necessarily have the behavior we want under microcontention, and they require more memory. C++ provides access to OS mutexes with std::mutex . Prior to WTF::Lock , WebKit had a mix of spinlocks and OS mutexes, and we would try to pick which one to use based on guesses about whether they would benefit more from uncontended speed or efficiency under contention. If we needed both of those qualities, we would have no choice but to punt on one of them. For example, we had a spinlock in CodeBlock that should have been adaptive because it protected long critical sections, and we had an OS mutex in our parallel GC that accounted for 3% of our time in the Octane Splay benchmark because of shortcomings in fast path performance. These issues are resolved thanks to WTF::Lock . Also, there was no way to have a small lock (1 byte or less) that was also efficient under contention, since OS mutexes tend to be large. In the most extreme case we will have one lock per JavaScript object, so we care about the sizes of our locks.

Building Locks With ParkingLot

WTF::ParkingLot is a framework for building adaptive locks and other synchronization primitives. Both WTF::Lock and WTF::Condition use it for parking threads until the lock becomes available or the condition is notified. ParkingLot also gives us everything we’ll need to implement the synchronization schemes that are coming to Web standards like SharedArrayBuffer .

Adaptive locks need to be able to park and unpark threads. We believe that synchronization primitives shouldn’t have to maintain their own parking queues, but instead, a single global data structure should provide a way to access queues by using the address of the lock’s atomic state as a key. The concurrent hashtable of queues is called WTF::ParkingLot . Since each thread can be queued only once at any given time, ParkingLot ‘s memory usage is bounded by the number of threads. This means that locks don’t have to pay the price for space for a queue. This makes a lot of sense since for WebKit, which usually runs a small number of threads (about ten on my system) but can easily allocate millions of locks (in the worst case, one per JavaScript object).

ParkingLot takes care of queueing and thread suspension so that lock algorithms can focus on other things, like how long to spin for, what kind of delays to introduce into spinning, and which threads to favor when unlocking.

ParkingLot API

Parking refers to suspending the thread while simultaneously enqueuing it on a queue keyed by some address. Unparking refers to dequeuing a thread from a queue keyed by some address and resuming it. This kind of API must have a mechanism for resolving the suspend-resume race, where if a resume operation happens moments before the suspend, then the thread will suspend anyway. ParkingLot resolves this by exposing the fact that the queues are protected by locks. Parking invokes a client callback while the queue lock is held, and gives the client a chance to decide whether they want to proceed or not. Unparking invokes a client callback while the queue lock is held, and tells the client if a thread was dequeued and if there are any more threads left on the queue. The client can rely on this additional synchronization to ensure that racy deadlocks don’t happen.

The basic API of ParkingLot comprises parkConditionally , unparkOne , and unparkAll .

bool parkConditionally(address, validation, beforeSleep, timeout) . This takes the const void* address and uses it as a key to find, and lock, that address’s queue. Calls the bool validation() callback (usually a C++ lambda ) while the lock is held. If the validation returns false, the queue is unlocked and parkConditionally() returns false.

If the validation returns true, the current thread is placed on the queue and the queue lock is released. Once the queue lock is released, this calls the void beforeSleep() callback. This turns out to be useful for some synchronization primitives, but most locks will pass an empty thunk. At this point, the thread is suspended until some call to unparkOne() dequeues this thread and resumes it. The client can supply a timeout using a ParkingLot::Clock::time_point ( ParkingLot::Clock is a typedef for std::chrono::steady_clock ). The thread will not stay suspended past that time point.

void unparkOne(address, callback) . This takes a const void* address and uses it as a key to find, and lock, that address’s queue. Then unparkOne tries to dequeue one thread. Once it does this, it calls the void callback(UnparkResult) , passing a struct that reports if a thread was dequeued and whether the queue is now empty. Then it unlocks the queue lock. If it had dequeued a thread, it signals it now.

void unparkAll(address) . Unparks all threads on the queue associated with the given address.

This API gives us everything we need to implement the locking primitives we need: WTF::Lock and WTF::Condition . It also allows us to build userland versions of FUTEX_WAIT / FUTEX_WAKE operations, which are required by SharedArrayBuffer atomics.

WTF::Lock

We have many locks, including locks inside very frequently allocated objects like JavaScriptCore’s Structure . We want a lock object that takes as little memory as possible, so we go to great lengths to make the lock fit in one byte. We do many such tricks in Structure since there is so much pressure to make that object small. We also want the core algorithm to leave open the possibility of having the lock embedded in bitfields, though Lock doesn’t support this because C++ requires objects to be at least one byte . As this section will show, ParkingLot makes it so easy to implement fast locking algorithms that if clients did need to embed a lock into a bitfield, it would be reasonable for them to have their own implementation of this algorithm.

Our goals are to have a lock that:

  1. Uses as few bits as possible.
  2. Requires only a CAS on the fast path for locking and unlocking to maximize uncontended throughput.
  3. Is adaptive.
  4. Maximizes throughput under contention.

Making the fast path require only a CAS means that WTF::Lock ‘s atomic state must be able tell us if there are any threads parked. Otherwise, the unlock() function would have to always call ParkingLot::unparkOne() in case there were threads parked. While such an implementation would be functional, it would be far from optimal. Afterall, ParkingLot::unparkOne() is obligated to do hashing, acquire some queue lock, and call a callback. This is a lot more work than we want in the common path of unlock() .

This implies having two bits for the atomic state:

  • isLockedBit to indicate if the lock is locked.
  • hasParkedBit to indicate if there may be threads parked.

Locking is allowed to proceed any time the isLockedBit is clear even if the hasParkedBit is set. This property is called barging . We will dive into

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Locking in WebKit

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址