In my Saturday community day session at SQL Bits I mentioned an article by Linchi Shea from the sqlblog site in which he demonstrates that the overhead of 100% foreign memory access does not incur the drop in query throughput you might think. The article can be found here. This got me thinking, what if you were to re-produce a similar test with the in-memory OLTP engine, given that we have zero overhead for locking, latching and zero overhead for interpreting the SQL language, would the effect of remote memory access be more pronounced ?.
To recap on what NUMA is, NUMA is the abbreviation for non-uniform memory access, essentially each CPU socket has its own private memory bus to its own bank of memory. SQLOS is NUMA aware and create a memory node for each NUMA node:
The rationale for NUMA existing in the first place is that prior to this all CPU sockets shared a single memory bus, as the number of CPUs in a server increased, so did the pressure on this single bus thus impeding scalability:
This is my test server:
Using this I’m going to carry out three tests:
- Set min memory to 16GB, max memory to 28GB and the CPU affinity mask to use only the first NUMA node, as the 64GB of memory is divided evenly between sockets. My hope is that by setting max memory to a value less than half of this (less because the operating system will require some memory), the memory access for node 0 should be biassed towards local NUMA node memory access, querying sys.dm_os_memory_nodes hints that this might be the case:
If I change the CPU affinity mask to only use node 1, this is what sys.dm_os_memory_nodes returns:
The key column to look at is page_kb, defined in books only from this article as:
Specifies the amount of committed memory, in KB, which is allocated from this NUMA node by Memory Manager Page Allocator. Not nullable.
- As I have ten cores per socket, I will then fire off ten concurrent threads each of which will insert 250,000 rows into a memory optimized table with a hash index.
- Same test as per the previous one, except with the CPU affinity mask set to use only the cores from the second CPU socket.
- Similar test as 1. but with max memory set to 56Gb and the CPU affinity mask set to use 5 cores from each socket.
I’m going to use ostress from rml utilities to fire off ten threads and this is the code I will use:
CREATE TABLE [dbo].[MyBigTableIm] ( [c1] [uniqueidentifier] NOT NULL DEFAULT NEWID() ,[c2] [datetime] NOT NULL DEFAULT GETDATE() ,[c3] [char](17) NOT NULL DEFAULT (CONVERT([char](17),rand())) ,[c4] [int] NOT NULL DEFAULT (round(rand()*(1000) ,(0))) ,[c5] [int] NOT NULL DEFAULT (round(rand()*(10000) ,(0))) ,[c6] [bigint] NOT NULL DEFAULT (round(rand()*(100000),(0))) PRIMARY KEY NONCLUSTERED HASH ( [c1] ) WITH ( BUCKET_COUNT = 524288) ) WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
Some code will be required to perform the actual inserts, two procedures in fact, a ‘Driver’ procedure:
CREATE PROCEDURE [dbo].[usp_insertImMain] WITH NATIVE_COMPILATION, SCHEMABINDING AS BEGIN ATOMIC WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'English') DECLARE @i INT = 0; WHILE @i &lt; 250000 BEGIN EXEC dbo.usp_insertIm SET @i += 1; END; END;
and a procedure to perform the actual insert which is invoked by the other procedure:
CREATE PROCEDURE [dbo].[usp_insertIm] WITH NATIVE_COMPILATION, SCHEMABINDING AS BEGIN ATOMIC WITH (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'English') INSERT INTO dbo.MyBigTableIm DEFAULT VALUES; END;
So as to ensure that the environment is ‘Clean’ each time a test is run the database engine service is stopped and started before hand, also I ran each test twice in succession, here are the results:
Interpreting The Results
- Why does the test run faster on node 1 ?This is pure conjecture, however I suspect that due to the way SQL OS flips nodes, because the operating system loads up node 0 with threads, SQL OS swaps nodes, there is a trace flag for preventing this behaviour which I will get around to testing at some stage.
- Why does the test run faster with 5 threads running on each node ( CPU socket )The table when fully populated takes up 248MB in memory, this means that it can be easily accommodated within each bank of memory associated with each CPU socket. However its my belief that the translation look-aside buffer (TLB) plays a key part in why the third test is the fastest. The translation look-aside buffer is a page table which helps to translate the address of virtual memory pages to the addresses of physical pages. The nature of a hash index for a memory optimised table is such that rows are evenly hashed across all buckets, I suspect this leads to the TLB being thrashed when only using one socket, remember that there is one memory controller per socket and this contains the TLB. I suspect that when there are two sockets used in the test leveraging eight memory channels (four per socket) might also help: