I thought you did. (From my tests, 8 bit calculations are simply not enough. - I got it working to try to save further memory, but it’s not good enough, so I’ll stick to 16 bits also).
What I don’t understand is when you say:

Yes, I use 1024 values of the signal and get the 256 points of the spectrum
If you use 1024 samples, you can get 512 frequency buckets to display (one half of the mirror output).
How come you only get 256?
512 real samples can give you 256 FFT frequency buckets…
I actually had the scaling wrong by one bit. I’ll release a fixed version soon.