STL, allocator, its shared memory and its features



Shared memory is the fastest way to exchange data between processes. But unlike streaming mechanisms (pipes, sockets of all stripes, file queues ...), here the programmer has complete freedom of action, as a result, they write who is what they want.



So the author once wondered what if ... if the addresses of shared memory segments degenerate in different processes. This is actually what happens when a shared memory process fork, but what about different processes? Also, not all systems have a fork.



It would seem that the addresses coincided, and so what? At the very least, you can use absolute pointers and this saves you a lot of headaches. It will become possible to work with C ++ strings and containers constructed from shared memory.



An excellent example, by the way. Not that the author really loved STL, but this is an opportunity to demonstrate a compact and understandable test for the performance of the proposed technique. Techniques that allow (as seen) to significantly simplify and accelerate interprocess communication. Whether it works and how you have to pay, we will understand further.



Introduction



The idea of ​​shared memory is simple and elegant - since each process operates in its own virtual address space, which is projected onto the system-wide physical, so why not allow two segments from different processes to look at the same physical memory area.



And with the proliferation of 64-bit operating systems and the ubiquitous use of coherent cache , the idea of ​​shared memory got a second wind. Now it is not just a cyclic buffer - a DIY implementation of a “pipe”, but a real “continuum transfunctioner” - an extremely mysterious and powerful device, moreover, only its mystery is equal to its power.



Let's look at a few examples of use.



  • “shared memory” MS SQL. (~10...15%)
  • Mysql Windows “shared memory”, .
  • Sqlite WAL-. , . (chroot).
  • PostgreSQL fork - . , .





    .1 PostgreSQL ()


Generally speaking, what kind of ideal shared memory would we like to see? This is an easy answer - we wish that objects in it could be used as if they were objects shared between threads of the same process. Yes, you need synchronization (and you need it anyway), but otherwise, you just take it and use it! Perhaps ... it can be arranged.



A proof of concept requires a minimal-meaningful task :



  • there is an analogue of std :: map <std :: string, std :: string> located in shared memory
  • we have N processes that asynchronously add / change values ​​with a prefix corresponding to the process number (ex: key_1_ ... for process number 1)
  • as a result, we can control the final result


Let's start with the simplest thing - since we have std :: string and std :: map , we need a special STL allocator.



Allocator STL



Suppose there are xalloc / xfree functions for working with shared memory as analogs of malloc / free . In this case, the allocator looks like this:



template <typename T>
class stl_buddy_alloc
{
public:
	typedef T                 value_type;
	typedef value_type*       pointer;
	typedef value_type&       reference;
	typedef const value_type* const_pointer;
	typedef const value_type& const_reference;
	typedef ptrdiff_t         difference_type;
	typedef size_t            size_type;
public:
	stl_buddy_alloc() throw()
	{	// construct default allocator (do nothing)
	}
	stl_buddy_alloc(const stl_buddy_alloc<T> &) throw()
	{	// construct by copying (do nothing)
	}
	template<class _Other>
	stl_buddy_alloc(const stl_buddy_alloc<_Other> &) throw()
	{	// construct from a related allocator (do nothing)
	}

	void deallocate(pointer _Ptr, size_type)
	{	// deallocate object at _Ptr, ignore size
		xfree(_Ptr);
	}
	pointer allocate(size_type _Count)
	{	// allocate array of _Count elements
		return (pointer)xalloc(sizeof(T) * _Count);
	}
	pointer allocate(size_type _Count, const void *)
	{	// allocate array of _Count elements, ignore hint
		return (allocate(_Count));
	}
};


This is enough to hook std :: map & std :: string on it




template <typename _Kty, typename _Ty>
class q_map : 
    public std::map<
        _Kty, 
        _Ty, 
        std::less<_Kty>, 
        stl_buddy_alloc<std::pair<const _Kty, _Ty> > 
    >
{ };

typedef std::basic_string<
        char, 
        std::char_traits<char>, 
        stl_buddy_alloc<char> > q_string


Before dealing with the declared xalloc / xfree functions , which work with the allocator on top of shared memory, it is worth understanding the shared memory itself.



Shared memory



Different threads of the same process are in the same address space, which means that every non- thread_local pointer in any thread looks at the same place. With shared memory, it takes extra effort to achieve this effect.



Windows



  • Let's create a file to memory mapping. Shared memory, like ordinary memory, is covered by the paging mechanism, here, among other things, it is determined whether we will use shared paging or allocate a special file for this.



    HANDLE hMapFile = CreateFileMapping(
    	INVALID_HANDLE_VALUE,     // use paging file
    	NULL,                     // default security
    	PAGE_READWRITE,           // read/write access
    	(alloc_size >> 32)        // maximum object size (high-order DWORD)
    	(alloc_size & 0xffffffff),// maximum object size (low-order DWORD)
    	"Local\\SomeData");       // name of mapping object


    The filename prefix “Local \\” means that the object will be created in the local namespace of the session.
  • To join a mapping already created by another process, use



    HANDLE hMapFile = OpenFileMapping(
    	FILE_MAP_ALL_ACCESS,      // read/write access
    	FALSE,                    // do not inherit the name
    	"Local\\SomeData");       // name of mapping object
  • Now you need to create a segment pointing to the finished display



    void *hint = (void *)0x200000000000ll;
    unsigned char *shared_ptr = (unsigned char*)MapViewOfFileEx(
    	hMapFile,                 // handle to map object
    	FILE_MAP_ALL_ACCESS,      // read/write permission
    	0,                        // offset in map object (high-order DWORD)
    	0,                        // offset in map object (low-order DWORD)
    	0,                        // segment size,
    	hint);                    // 
    


    segment size 0 means that the size with which the display was created, taking into account the shift, will be used.



    The most important thing here is hint. If it is not specified (NULL), the system will select the address at its discretion. But if the value is nonzero, an attempt will be made to create a segment of the desired size with the desired address. It is by defining its value as the same in different processes that we achieve the degeneration of shared memory addresses. In 32-bit mode, finding a large unallocated contiguous chunk of the address space is not easy, but in 64-bit mode there is no such problem, you can always find something suitable.


Linux



Everything is basically the same here.



  • Create a shared memory object



      int fd = shm_open(
                     “/SomeData”,               //  ,   /
                     O_CREAT | O_EXCL | O_RDWR, // flags,  open
                     S_IRUSR | S_IWUSR);        // mode,  open
    
      ftruncate(fd, alloc_size);
    


    ftruncate . shm_open /dev/shm/. shmget\shmat SysV, ftok (inode ).




  • int fd = shm_open(“/SomeData”, O_RDWR, 0);




  •   void *hint = (void *)0x200000000000ll;
      unsigned char *shared_ptr = (unsigned char*) = mmap(
                       hint,                      // 
                       alloc_size,                // segment size,
                       PROT_READ | PROT_WRITE,    // protection flags
                       MAP_SHARED,                // sharing flags
                       fd,                        // handle to map object
                       0);                        // offset
    


    hint.




Regarding hint, what are the restrictions on its value? Actually, there are different kinds of restrictions.



First , the architectural / hardware. A few words should be said here about how a virtual address turns into a physical one. If there is a TLB cache miss , you have to access a tree structure called the page table . For example, in IA-32 it looks like this:





Fig.2 case of 4K pages, taken here



Entry into the tree is the contents of the register CR3, the indexes in the pages of different levels are fragments of the virtual address. In this case, 32 bits turn into 32 bits, everything is fair.



In AMD64, the picture looks a little different.





Fig. 3 AMD64, 4K pages, taken from here



CR3 now has 40 significant bits instead of 20 previously, in a tree of 4 levels of pages, the physical address is limited to 52 bits while the virtual address is limited to 48 bits.



And only (starting with) the Ice Lake microarchitecture (Intel) is it allowed to use 57 bits of the virtual address (and still 52 physical) when working with a 5-level page table.



So far, we've only talked about Intel / AMD. Just for a change, in the Aarch64 architecture, the page table can be 3 or 4 levels, allowing the use of 39 or 48 bits in the virtual address, respectively ( 1 ).



Secondly, software restrictions. Microsoft, in particular, imposes (44 bits up to 8.1 / Server12, 48 starting from) those on different OS options based on, among other things, marketing considerations.



By the way, 48 digits, this is 65 thousand times 4GB each, perhaps in such open spaces there is always a corner where you can stick with your hint.



Shared memory allocator



First of all. The allocator must live on the allocated shared memory, placing all its internal data there.



Secondly. We are talking about an interprocess communication tool, any optimizations associated with the use of TLS are irrelevant.



Thirdly. Since several processes are involved, the allocator itself can live for a very long time, reducing external memory fragmentation is of particular importance .



Fourth. Calling the OS for additional memory is not allowed. So, dlmalloc , for example, allocates relatively large chunks directly via mmap . Yes, it can be weaned by raising the threshold, but nonetheless.



Fifth. Standard in-process synchronization facilities are not suitable, either global with a corresponding overhead or something located directly in shared memory, such as spinlocks, is required. Let's say thanks to the coherent cache. In posix there are also unnamed shared semaphores for this case .



In total, taking into account all of the above and also because there was a live allocator by the method of twins at hand (kindly provided by Alexander Artyushin, slightly revised), the choice turned out to be easy.



Let's leave the description of the implementation details until better times, now the public interface is interesting:



class BuddyAllocator {
public:
	BuddyAllocator(uint64_t maxCapacity, u_char * buf, uint64_t bufsize);
	~BuddyAllocator(){};

	void *allocBlock(uint64_t nbytes);
	void freeBlock(void *ptr);
...
};


The destructor is trivial because BuddyAllocator does not grab any extraneous resources.



Final preparations



Since everything is located in shared memory, this memory must have a header. For our test, this header looks like this:



struct glob_header_t {
	//     magic
	uint64_t magic_;
	// hint     
	const void *own_addr_;
	//  
	BuddyAllocator alloc_;
	// 
	std::atomic_flag lock_;
	//   
	q_map<q_string, q_string> q_map_;

	static const size_t alloc_shift = 0x01000000;
	static const size_t balloc_size = 0x10000000;
	static const size_t alloc_size = balloc_size + alloc_shift;
	static glob_header_t *pglob_;
};
static_assert (
    sizeof(glob_header_t) < glob_header_t::alloc_shift, 
    "glob_header_t size mismatch");

glob_header_t *glob_header_t::pglob_ = NULL;


  • own_addr_ is written when creating shared memory so that everyone who attaches to it by name can find out the actual address (hint) and re-connect if necessary
  • it's not good to hardcode the dimensions like this, but it's acceptable for tests
  • the constructor (s) should be called by the process creating the shared memory, it looks like this:



    glob_header_t::pglob_ = (glob_header_t *)shared_ptr;
    
    new (&glob_header_t::pglob_->alloc_)
            qz::BuddyAllocator(
                    //  
                    glob_header_t::balloc_size,
                    //  
                    shared_ptr + glob_header_t::alloc_shift,
                    //   
                    glob_header_t::alloc_size - glob_header_t::alloc_shift;
    
    new (&glob_header_t::pglob_->q_map_) 
                    q_map<q_string, q_string>();
    
    glob_header_t::pglob_->lock_.clear();
    
  • the process connecting to shared memory gets everything ready
  • now we have everything we need for tests except the xalloc / xfree functions



    void *xalloc(size_t size)
    {
    	return glob_header_t::pglob_->alloc_.allocBlock(size);
    }
    void xfree(void* ptr)
    {
    	glob_header_t::pglob_->alloc_.freeBlock(ptr);
    }
    


It looks like we can start.



Experiment



The test itself is very simple:



for (int i = 0; i < 100000000; i++)
{
        char buf1[64];
        sprintf(buf1, "key_%d_%d", curid, (i % 100) + 1);
        char buf2[64];
        sprintf(buf2, "val_%d", i + 1);

        LOCK();

        qmap.erase(buf1); //   
        qmap[buf1] = buf2;

        UNLOCK();
}


Curid is the process / thread number, the process that created the shared memory has zero curid, but it doesn't matter for the test.

Qmap , LOCK / UNLOCK are different for different tests.



Let's do some tests



  1. THR_MTX - a multithreaded application, synchronization goes through std :: recursive_mutex ,

    qmap - global std :: map <std :: string, std :: string>
  2. THR_SPN is a multithreaded application, synchronization goes through a spinlock:



    std::atomic_flag slock;
    ..
    while (slock.test_and_set(std::memory_order_acquire));  // acquire lock
    …
    slock.clear(std::memory_order_release);                 // release lock


    qmap - global std :: map <std :: string, std :: string>
  3. PRC_SPN - several running processes, synchronization goes through a spinlock:



    while (glob_header_t::pglob_->lock_.test_and_set(              // acquire lock
            std::memory_order_acquire));                          
    …
    glob_header_t::pglob_->lock_.clear(std::memory_order_release); // release lock
    qmap - glob_header_t :: pglob _-> q_map_
  4. PRC_MTX - several running processes, synchronization goes through a named mutex .



    qmap - glob_header_t :: pglob _-> q_map_


Results (test type vs. number of processes / threads):

1 2 4 8 sixteen
THR_MTX 1'56 '' 5'41 '' 7'53 '' 51'38 '' 185'49
THR_SPN 1'26 '' 7'38 '' 25'30 '' 103'29 '' 347'04 ''
PRC_SPN 1'24 '' 7'27 '' 24'02 '' 92'34 '' 322'41 ''
PRC_MTX 4'55 '' 13'01 '' 78'14 '' 133'25 '' 357'21 ''


The experiment was carried out on a dual-processor (48 cores) computer with Xeon® Gold 5118 2.3GHz, Windows Server 2016.



Total



  • Yes, it is possible to use STL objects / containers (allocated in shared memory) from different processes , provided they are designed appropriately.
  • , , PRC_SPN THR_SPN. , BuddyAllocator malloc\free MS ( ).
  • . — + std::mutex . lock-free , .




Shared memory is often used to transfer large data streams as a kind of "pipe" made by hand. This is a great idea even though you need to arrange for expensive synchronization between processes. We saw that it is not cheap on the PRC_MTX test, when work, even without competition, inside one process degraded performance significantly.



The explanation for the high cost is simple, if std: :( ​​recursive_) mutex (critical section under windows) can work like a spinlock, then a named mutex is a system call, entering kernel mode with the corresponding costs. Also, the loss of execution context by a thread / process is always very expensive.



But since synchronization of processes is inevitable, how can we reduce costs? The answer has long been invented - buffering. Not every single packet is synchronized, but a certain amount of data - the buffer into which this data is serialized. If the buffer is noticeably larger than the packet size, then you have to synchronize much less often.



It is convenient to mix two techniques - data in shared memory, and only relative pointers (from the beginning of shared memory) are sent through the interprocess data channel (ex: loop through localhost). Because the pointer is usually smaller than the data packet, saving on synchronization.



And in the case when shared memory is available to different processes at the same virtual address, you can add a little more performance.



  • do not serialize data for sending, do not deserialize on receipt
  • send honest pointers to objects created in shared memory through the stream
  • when we get a ready (pointer) object, we use it, then we delete it through a regular delete, all memory is automatically freed. This saves us from messing with the ring buffer.
  • you can even send not a pointer, but (the minimum possible - a byte with the value “you have mail”) a notification of the fact that there is something in the queue


Finally



Do's and don'ts with shared memory objects.



  1. Use RTTI . For obvious reasons. The std :: type_info object exists outside of shared memory and is not available across processes.
  2. Use virtual methods. For the same reason. The virtual function tables and the functions themselves are not available across processes.
  3. If we talk about STL, all executable files of processes sharing memory must be compiled by one compiler with the same settings, and the STL itself must be the same.


PS : thanks to Alexander Artyushin and Dmitry Iptyshev (Dmitria) for help in preparing this article.



All Articles