Previous block size could hold only 256 elements (8KB), which stressed
out the memory allocator. Storing 65536 elements (2MB) per block almost
completely reduces the allocator pressure.
("The queue" is per-thread partial queue here.)
This fixes a problem where one thread writes to the queue, then is
terminated, making the (partially filled) queue available for other
threads to recycle. If another thread re-owns the queue, it will change
the associated thread id, while part of the queue was filled by the
original thread. This obviously created invalid data during dequeue.
The fix makes the recycling process check not only for queue inactivity
(which is marked when the original thread terminates), but also if the
queue is empty, preventing mixing data from different threads.