TransWikia.com

Sharing array of objects with Python multiprocessing

Stack Overflow Asked by 119631 on December 23, 2021

For this question, I refer to the example in Python docs discussing the "use of the SharedMemory class with NumPy arrays, accessing the same numpy.ndarray from two distinct Python shells".

A major change that I’d like to implement is manipulate array of class objects rather than integer values as I demonstrate below.

import numpy as np
from multiprocessing import shared_memory    

# a simplistic class example
class A(): 
    def __init__(self, x): 
        self.x = x

# numpy array of class objects 
a = np.array([A(1), A(2), A(3)])       

# create a shared memory instance
shm = shared_memory.SharedMemory(create=True, size=a.nbytes, name='psm_test0')

# numpy array backed by shared memory
b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)                                    

# copy the original data into shared memory
b[:] = a[:]                                  

print(b)                                            

# array([<__main__.Foo object at 0x7fac56cd1190>,
#       <__main__.Foo object at 0x7fac56cd1970>,
#       <__main__.Foo object at 0x7fac56cd19a0>], dtype=object)

Now, in a different shell, we attach to the shared memory space and try to manipulate the contents of the array.

import numpy as np
from multiprocessing import shared_memory

# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')

c = np.ndarray((3,), dtype=object, buffer=existing_shm.buf)

Even before we are able to manipulate c, printing it will result in a segmentation fault. Indeed I can not expect to observe a behaviour that has not been written into the module, so my question is what can I do to work with a shared array of objects?

I’m currently pickling the list but protected read/writes add a fair bit of overhead. I’ve also tried using Namespace, which was quite slow because indexed writes are not allowed. Another idea could be to use share Ctypes Structure in a ShareableList but I wouldn’t know where to start with that.

In addition there is also a design aspect: it appears that there is an open bug in shared_memory that may affect my implementation wherein I have several processes working on different elements of the array.

Is there a more scalable way of sharing a large list of objects between several processes so that at any given time all running processes interact with a unique object/element in the list?

UPDATE: At this point, I will also accept partial answers that talk about whether this can be achieved with Python at all.

One Answer

So, I did a bit of research (Shared Memory Objects in Multiprocessing) and came up with a few ideas:

Pass numpy array of bytes

Serialize the objects, then save them as byte strings to a numpy array. Problematic here is that

  1. One one needs to pass the data type from the creator of 'psm_test0' to any consumer of 'psm_test0'. This could be done with another shared memory though.

  2. pickle and unpickle is essentailly like deepcopy, i.e. it actually copies the underlying data.

The code for the 'main' process reads:

import pickle
from multiprocessing import shared_memory
import numpy as np


# a simplistic class example
class A():
    def __init__(self, x):
        self.x = x

    def pickle(self):
        return pickle.dumps(self)

    @classmethod
    def unpickle(self, bts):
        return pickle.loads(bts)


if __name__ == '__main__':
    # Test pickling procedure
    a = A(1)
    print(A.unpickle(a.pickle()).x)
    # >>> 1

    # numpy array of byte strings
    a_arr = np.array([A(1).pickle(), A(2).pickle(), A('This is a really long test string which should exceed 42 bytes').pickle()])

    # create a shared memory instance
    shm = shared_memory.SharedMemory(
        create=True,
        size=a_arr.nbytes,
        name='psm_test0'
    )

    # numpy array backed by shared memory
    b_arr = np.ndarray(a_arr.shape, dtype=a_arr.dtype, buffer=shm.buf)

    # copy the original data into shared memory
    b_arr[:] = a_arr[:]

    print(b_arr.dtype)
    # 'S105'

and for the consumer

import numpy as np
from multiprocessing import shared_memory
from test import A

# attach to the existing shared space
existing_shm = shared_memory.SharedMemory(name='psm_test0')

c = np.ndarray((3,), dtype='S105', buffer=existing_shm.buf)

# Test data transfer
arr = [a.x for a in list(map(A.unpickle, c))]
print(arr)
# [1, 2, ...]

I'd say you have a few ways of going forward:

  1. Stay with simple data types.

  2. Implement something using the C api, but I can't really help you there.

  3. Use Rust

  4. Use Mangers. You maybe loose out on some performance (I'd like to see a real benchmark though), but You can get a relatively safe and simple interface for shared objects.

  5. Use Redis, which also has Python bindings...

Answered by AlexNe on December 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP