神刀安全网

Go and Quasar: a comparison of style and performance on the Skynet benchmark

May 03, 2016

Go and Quasar: a comparison of style and performance

By Ron

A user recently made us aware of the Skynet benchmark , a microbenchmark for “extreme” multithreading ( 1M threads). We are generally wary of such microbenchmarks because they are often tailored to measure a specific strength of a particular platform, without taking into account how relevant that strength is for real applications. For example, a platform with a 1000x faster implementation of sqrt would be hard pressed to yield even a 0.01 % improvement in performance when running real applications. With threads the situation is a bit different: when many threads are active (say, over 10K ) processing transactions in short bursts, the kernel thread scheduling overhead might become onerous and your application may then spend a significant portion of its time waiting for the kernel to schedule your code. Lightweight thread ( AKA fibers) implementations, like those provided by Go, Erlang, and, on the JVM , Quasar (and Kilim ), can reduce this overhead by two orders of magnitude. This may be the difference between your server application being able to handle 500 or 5000 requests per second (some benchmarks can be foundhere andhere).

However, once the threading overhead is reasonably low, differences in a particular implementation matter less and less: If the overhead is say, under 1% of the total time, reducing it by 1000x would result in an improvement of no more than 1%. Because the JVM does not yet have built-in fibers, Quasar is required to implement them in a way that adds more overhead than platforms with native implementations. This is why in a microbenchmark that tests scheduling overhead alone, a generally slow runtime like Erlang’s BEAM may outperform a very fast runtime like HotSpot, even though once there’s any actual workload, the JVM quickly makes up the difference and then some (and then a lot, really). To further confuse the picture, some classical scheduling benchmarks like the ring benchmark actually reward schedulers that are only good at single-core scheduling and penalize schedulers that are good at sharing load among many cores.

Threads in Skynet fan out to create child threads and synchronize on them in more interesting ways than in the crude ring structure, though. While this is still an “overhead only” benchmark, at least it actually measures not only the ability to block and unblock a thread, but also to make good use of all available processor cores and that’s why we decided to give it a try by translating the Go implementation into Java + Quasar. This is the original Go code, taken from here :

package main  import "fmt" import "time"  func skynet(c chan int, num int, size int, div int) {  if size == 1 {   c <- num   return  }   rc := make(chan int)  var sum int  for i := 0; i < div; i++ {   subNum := num + i*(size/div)   go skynet(rc, subNum, size/div, div)  }  for i := 0; i < div; i++ {   sum += <-rc  }  c <- sum }  func main() {  c := make(chan int)  start := time.Now()  go skynet(c, 0, 1000000, 10)  result := <-c  took := time.Since(start)  fmt.Printf("Result: %d in %d ms./n", result, took.Nanoseconds()/1e6) } 

And this is the Java code(using Quasar), translated from Go pretty much line by line (taken from here ):

import co.paralleluniverse.fibers.*; import co.paralleluniverse.strands.channels.Channel; import static co.paralleluniverse.strands.channels.Channels.*; import java.time.*;  public class Skynet {     static void skynet(Channel<Long> c, long num, int size, int div) throws SuspendExecution, InterruptedException {         if (size == 1) {             c.send(num);             return;         }          Channel<Long> rc = newChannel(BUFFER);         long sum = 0L;         for (int i = 0; i < div; i++) {             long subNum = num + i * (size / div);             new Fiber(() -> skynet(rc, subNum, size / div, div)).start();         }         for (int i = 0; i < div; i++)             sum += rc.receive();         c.send(sum);     }      public static void main(String[] args) throws Exception {         for (int i = 0 ; i < RUNS ; i++) {             Instant start = Instant.now();              Channel<Long> c = newChannel(BUFFER);             new Fiber(() -> skynet(c, 0, 1_000_000, 10)).start();             long result = c.receive();              Duration elapsed = Duration.between(start, Instant.now());             System.out.println((i + 1) + ": " + result + " (" + elapsed.toMillis() + " ms)");         }     }      static final int RUNS = 4;     static final int BUFFER = 0; // = 0 unbufferd, > 0 buffered ; < 0 unlimited } 

The first thing to notice is how similar the Java code is to the Go code. Quasar basically imports the entire Go and Erlang programming models into Java, including channel selection from Go, as well as actor supervision, behaviors and hot code reloading from Erlang.

The initial benchmark results were less than stellar (we also uncovered a hidden bug in the process), but after profiling and making some straightforward improvements we got these average figures on my MacBook laptop (using go1.6.2 , java 1.8.0_40 and after dropping the first couple of Java runs, required for JVM warmup):

Go, unbuffered channels:     350 ms Go, buffered channels:       310 ms  Quasar, unbuffered channels: 900 ms Quasar, buffered channels:   360 ms 

There’s apparently a big difference depending on the kind of Quasar channel used: unbuffered channels introduce many more synchronization events (because every send must wait for a receive) and perform significantly worse than unbuffered channels, whereas in Go the difference is very small. A careful profiling uncovered that in the unbuffered channel case, the bulk of the overhead is indeed spent in the synchronization code, while in the buffered case the overhead was indeed mostly the internal implementation of continuations employed by Quasar. We’ve found more room for improvement in the channel synchronization code and we’re confident that we can get even better results, although it’s not critically important. In real use cases, the current level of overhead introduced by Quasar is low enough that most workloads – even minor – would drown it out completely.

One more thing: the above Java code is not normally how you’d write this program. Every Quasar fiber can return a result and calling Fiber.get() blocks and waits for the fiber to return it (in fact, the Fiber class implements j.u.c.Future ). The last code sample, which you’ll find below, is how you’d idiomatically write Skynet in Java with Quasar.

On my machine that code runs in ~300 ms – same as or ahead of Go’s result – with a similar number of synchronization events as the buffered channel case, as the fibers aren’t contending on the same channel to write their results.

This overhead-only microbenchmark, as expected, gives the advantage to the platform that handles the overhead natively but we were surprised by how slight the advantage is, especially as there’s room for improvement in Quasar’s channel implementation. I think this is yet another testament to the versatility of the JVM as a general-purpose, polyglot, high-performance platform.

Here’s the more idiomatic implementation of the Skynet benchmark in Quasar (code taken from here ; to run, clone this repo and run  ./gradlew ):

import co.paralleluniverse.fibers.*; import java.util.concurrent.*; import java.time.*;  public class Skynet {     static long skynet(long num, int size, int div) throws SuspendExecution, InterruptedException {         try {             if (size == 1)                 return num;              Fiber<Long>[] children = new Fiber[div];             long sum = 0L;             for (int i = 0; i < div; i++) {                 long subNum = num + i * (size / div);                 children[i] = new Fiber<>(() -> skynet(subNum, size / div, div)).start();             }             for (Fiber<Long> c : children)                 sum += c.get();             return sum;         } catch (ExecutionException e) {             throw (RuntimeException) e.getCause();         }     }      public static void main(String[] args) throws Exception {         for (int i = 0; i < RUNS; i++) {             Instant start = Instant.now();              long result = new Fiber<>(() -> skynet(0, 1_000_000, 10)).start().get();              Duration elapsed = Duration.between(start, Instant.now());             System.out.println((i + 1) + ": " + result + " (" + elapsed.toMillis() + " ms)");         }     }      static final int RUNS = 4; } 

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Go and Quasar: a comparison of style and performance on the Skynet benchmark

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址