i.e. we roll our own optimised code that plays nice with our custom memory design
e.g. (naiive memcpy example that's optimised for luajit)
function __FUNCS__.memcpy(dest, src, len)
local fastLen = bit.band(len,-4) -- bit.bnot(3)
for i=0,fastLen-1,4 do
__MEMORY_WRITE_32__(mem_0,dest+i,__MEMORY_READ_32__(mem_0,src+i))
end
for i=(len-fastLen),len-1,1 do
__MEMORY_WRITE_8__(mem_0,dest+i,__MEMORY_READ_8__(mem_0,src+i))
end
return dest
end
Before:

After:

This can be optimised further in the pure-lua memory version by allowing memset to copy over fpMap too (so that memory type hints are preserved, and speed is therefore preserved without having to convert between floats and ints)