mercredi 9 avril 2014

Tri - ARM NEON : trier un tableau de 16 octets - Stack Overflow


tl;dr: What is the fastest way to sort an uint8x16_t?


I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.


And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.


For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.


Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.


Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)



tl;dr: What is the fastest way to sort an uint8x16_t?


I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.


And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.


For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.


Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.


Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)


0 commentaires:

Enregistrer un commentaire