Embedded World 2023 - STM32 CORDIC CO-PROCESSOR

runger · March 21, 2023, 12:04pm

I have to think about this and look at the code more carefully. I think the normalisation should ensure the angle is in the range (0, 2π] , e.g. never actually have the value of 2π because it is the same as 0. Otherwise we have two 0 values.
On the other hand, the way this function is used it doesn’t matter as long as the result of _sin(0) and _sin(2π) is the same. But it does matter when generating the LUT, to make sure the discretisation is correct.
Another thought is for the interpolation, where it could be convenient to have the extra array element for “interpolating the wrap-around”…

Your point regarding the Lepton is a good one, we can easily afford some extra bytes due to the uint16_t savings.

Just thinking about what it will take to get this included into the code-base:

I think we can start by making the _sin() and _cos() implementations weakly bound, that way people can easily replace them with their own implementations (for example CORDIC based )
but as we saw most of the performance problems are caused by _normalizeAngle(). This is also needed outside of the context of sine/cosine operations, so I have to think about this some more
we’ll do some more precision tests and speed tests, but from your results it looks like we could improve both at the same time, while also reducing the memory used

dekutree64 · March 21, 2023, 1:44pm

The bitmask normalization in this function does wrap 2pi back to 0. The extra entry is for quadrant turnaround points. Think of it as the zero entry for descending quadrants. And it does serve as the interpolation endpoint. By the time sine_array[16] becomes the ‘a’ value for interpolation, you’ve stepped into the next quadrant and the ‘b’ value is sine_array[15], headed back toward the start of the array.

After a quick look through the code, I think the call to _normalizeAngle in BLDCMotor::setPhaseVoltage for SpaceVectorPWM is the only one this will actually eliminate. The one for SinePWM is unnecessary as is, the trapezoid modes will still need it, and conceptually FOCMotor::electricalAngle() should return a normalized value, so I don’t think we should remove that one even though it will be technically unneeded with the new sin/cos.

This interpolating version will need speed testing against the original. It may actually be slower and not worth doing, now that I know the normalize calls can’t be removed…

Juan-Antonio_Soren_E · March 21, 2023, 3:51pm

Added initial CORDIC support here →

Initial 8pwm by Juanduino · Pull Request #260 · simplefoc/Arduino-FOC (github.com)

runger · March 21, 2023, 5:29pm

That wasn’t my conclusion. But maybe I need to look again. But based on the comments in the code I thought it gets rid of most of them, except the initial electrical angle determination and the trapezoidal modes…

In terms of my tests, the 65 entry LUT with interpolation did not do so well, worse than the CORDIC. Apparently ldext is slow…

But the 256 entry LUT version performs like this:

Timing CORDIC vs stdlib sin vs SimpleFOC Sine calculations...

CORDIC:
CORDIC Time (us) for 3217 steps: 6572
Result: 2048.00

SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 926
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 2684
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 793
Result: 2047.94

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000046
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku Sine and stdlib: 0.00125757

So a little faster, and a little more accurate. And of course the bigger gain is the built-in normalisation and avoiding fmod():

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 3147
Result: 2047.98

dekutree64 · March 21, 2023, 10:19pm

Oh right, I’ve been going with maximum difference rather than RMS. Here are measurements of both methods on PC (minor discrepancies with your numbers are probably due to use of double rather than float for the stdlib sin() and the error accumulator):

RMS difference between SimpleFOC and stdlib:   0.001611
Max difference between SimpleFOC and stdlib: +-0.003970

RMS difference between 257 entry LUT and stdlib:   0.001253
Max difference between 257 entry LUT and stdlib: +-0.003108

RMS difference between 17 entry LUT with interpolation and stdlib:   0.000642
Max difference between 17 entry LUT with interpolation and stdlib: +-0.001290

If you do more speed tests, it would be interesting to see the 17 LUT+interpolation without ldexp. You only need to change the last line to return (1.0f/32768.0f) * (a + (((b - a) * frac) >> 8));
But it will still be slow due to the two loads and the interpolation multiply. Its main advantage is low flash space due to the tiny LUT. If that allows including the combined _sincos, you could measure that against SimpleFOC _sin and _cos. It may end up winning after all.

But the 257 LUT is my favorite. Fast and no more range restriction. And could be even faster if I can figure out how to get rid of the + 1) >> 1) rounding operation without sacrificing accuracy… But if the larger table is too much, then maybe we should just stick with the original.

One possibility would be to make a weak _sincos that just calls _sin and _cos, and modify all the library code to use it. Then you can override it with the fastest 257 LUT _sincos if you have the flash space for it. I tried compiling for my Lepton with it to see how much it would add, but I’m getting nonsensical results. More code uses less flash space…

runger · March 21, 2023, 10:30pm

Here’s my results on the Nano (ATMega328):

Timing Deku vs stdlib sin vs SimpleFOC Sine calculations...


SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 206988
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 415776
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 172564
Result: 2047.94

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 230452
Result: 2047.98

Comparing accuracy...
RMS difference between Deku65 Sine and stdlib: 0.01027036
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku256 Sine and stdlib: 0.00125757
Test complete.

And on the Raspberry Pico:


Timing Deku vs stdlib sin vs SimpleFOC Sine calculations...


SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 16265
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 19578
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 13543
Result: 2047.94

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 18670
Result: 2047.98

Comparing accuracy...
RMS difference between Deku65 Sine and stdlib: 0.00006480
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku256 Sine and stdlib: 0.00125757
Test complete.

runger · March 23, 2023, 8:30pm

Playing around some more with this…

The performance winner so far on the STM32G4 is this:

unsigned short sine_array4[257] = {0, 201, 402, 603, 804, 1005, 1206, 1407, 1608, 1809, 2009, 2210, 2411, 2611, 2811, 3012, 3212, 3412, 3612, 3812, 4011, 4211, 4410, 4609, 4808, 5007, 5205, 5404, 5602, 5800, 5998, 6195, 6393, 6590, 6787, 6983, 7180, 7376, 7571, 7767, 7962, 8157, 8351, 8546, 8740, 8933, 9127, 9319, 9512, 9704, 9896, 10088, 10279, 10469, 10660, 10850, 11039, 11228, 11417, 11605, 11793, 11980, 12167, 12354, 12540, 12725, 12910, 13095, 13279, 13463, 13646, 13828, 14010, 14192, 14373, 14553, 14733, 14912, 15091, 15269, 15447, 15624, 15800, 15976, 16151, 16326, 16500, 16673, 16846, 17018, 17190, 17361, 17531, 17700, 17869, 18037, 18205, 18372, 18538, 18703, 18868, 19032, 19195, 19358, 19520, 19681, 19841, 20001, 20160, 20318, 20475, 20632, 20788, 20943, 21097, 21251, 21403, 21555, 21706, 21856, 22006, 22154, 22302, 22449, 22595, 22740, 22884, 23028, 23170, 23312, 23453, 23593, 23732, 23870, 24008, 24144, 24279, 24414, 24548, 24680, 24812, 24943, 25073, 25202, 25330, 25457, 25583, 25708, 25833, 25956, 26078, 26199, 26320, 26439, 26557, 26674, 26791, 26906, 27020, 27133, 27246, 27357, 27467, 27576, 27684, 27791, 27897, 28002, 28106, 28209, 28311, 28411, 28511, 28610, 28707, 28803, 28899, 28993, 29086, 29178, 29269, 29359, 29448, 29535, 29622, 29707, 29792, 29875, 29957, 30038, 30118, 30196, 30274, 30350, 30425, 30499, 30572, 30644, 30715, 30784, 30853, 30920, 30986, 31050, 31114, 31177, 31238, 31298, 31357, 31415, 31471, 31527, 31581, 31634, 31686, 31737, 31786, 31834, 31881, 31927, 31972, 32015, 32058, 32099, 32138, 32177, 32214, 32251, 32286, 32319, 32352, 32383, 32413, 32442, 32470, 32496, 32522, 32546, 32568, 32590, 32610, 32629, 32647, 32664, 32679, 32693, 32706, 32718, 32729, 32738, 32746, 32753, 32758, 32762, 32766, 32767, 32768};

float deku_sin257(float a) {
  unsigned int i = ((unsigned int)(a * (256*8 /_2PI) + 1) >> 1) & 0x3ff;
  if (i < 256) {
    return (1/32768.0f)*sine_array4[i];
  }
  else if(i < 512) {
    return (1/32768.0f)*sine_array4[512 - i];
  }
  else if(i < 768) {
    return -(1/32768.0f)*sine_array4[-512 + i];
  }
  else {
    return -(1/32768.0f)*sine_array4[1024 - i];
  }
}

A close second is a pure float table:

float f_sine_array[257] = { 0.0f, 0.006135884649154475f, 0.012271538285719925f, 0.01840672990580482f, 0.024541228522912288f, 0.030674803176636626f, 0.03680722294135883f, 0.04293825693494082f, 0.049067674327418015f, 0.05519524434968994f, 0.06132073630220858f, 0.06744391956366405f, 0.07356456359966743f, 0.07968243797143013f, 0.0857973123444399f, 0.09190895649713272f, 0.0980171403295606f, 0.10412163387205459f, 0.11022220729388306f, 0.11631863091190475f, 0.1224106751992162f, 0.12849811079379317f, 0.13458070850712617f, 0.1406582393328492f, 0.14673047445536175f, 0.15279718525844344f, 0.15885814333386145f, 0.16491312048996992f, 0.17096188876030122f, 0.17700422041214875f, 0.18303988795514095f, 0.1890686641498062f, 0.19509032201612825f, 0.2011046348420919f, 0.20711137619221856f, 0.21311031991609136f, 0.2191012401568698f, 0.22508391135979283f, 0.2310581082806711f, 0.2370236059943672f, 0.24298017990326387f, 0.24892760574572015f, 0.25486565960451457f, 0.2607941179152755f, 0.26671275747489837f, 0.272621355449949f, 0.27851968938505306f, 0.2844075372112719f, 0.29028467725446233f, 0.2961508882436238f, 0.3020059493192281f, 0.30784964004153487f, 0.3136817403988915f, 0.3195020308160157f, 0.3253102921622629f, 0.33110630575987643f, 0.33688985339222005f, 0.3426607173119944f, 0.34841868024943456f, 0.35416352542049034f, 0.3598950365349881f, 0.36561299780477385f, 0.37131719395183754f, 0.37700741021641826f, 0.3826834323650898f, 0.38834504669882625f, 0.3939920400610481f, 0.3996241998456468f, 0.40524131400498986f, 0.4108431710579039f, 0.41642956009763715f, 0.4220002707997997f, 0.4275550934302821f, 0.43309381885315196f, 0.43861623853852766f, 0.4441221445704292f, 0.44961132965460654f, 0.45508358712634384f, 0.46053871095824f, 0.4659764957679662f, 0.47139673682599764f, 0.4767992300633221f, 0.4821837720791227f, 0.487550160148436f, 0.49289819222978404f, 0.49822766697278187f, 0.5035383837257176f, 0.508830142543107f, 0.5141027441932217f, 0.5193559901655896f, 0.524589682678469f, 0.5298036246862946f, 0.5349976198870972f, 0.5401714727298929f, 0.5453249884220465f, 0.5504579729366048f, 0.5555702330196022f, 0.560661576197336f, 0.5657318107836131f, 0.5707807458869673f, 0.5758081914178453f, 0.5808139580957645f, 0.5857978574564389f, 0.5907597018588742f, 0.5956993044924334f, 0.600616479383869f, 0.6055110414043255f, 0.6103828062763095f, 0.6152315905806268f, 0.6200572117632891f, 0.6248594881423863f, 0.629638238914927f, 0.6343932841636455f, 0.6391244448637757f, 0.6438315428897914f, 0.6485144010221124f, 0.6531728429537768f, 0.6578066932970786f, 0.6624157775901718f, 0.6669999223036375f, 0.6715589548470183f, 0.6760927035753159f, 0.680600997795453f, 0.6850836677727004f, 0.6895405447370668f, 0.6939714608896539f, 0.6983762494089729f, 0.7027547444572253f, 0.7071067811865475f, 0.7114321957452163f, 0.7157308252838186f, 0.7200025079613817f, 0.7242470829514669f, 0.7284643904482252f, 0.7326542716724127f, 0.7368165688773698f, 0.740951125354959f, 0.745057785441466f, 0.7491363945234593f, 0.7531867990436124f, 0.7572088465064845f, 0.7612023854842618f, 0.7651672656224588f, 0.7691033376455796f, 0.7730104533627369f, 0.7768884656732324f, 0.7807372285720944f, 0.7845565971555752f, 0.7883464276266062f, 0.7921065773002123f, 0.7958369046088835f, 0.799537269107905f, 0.8032075314806448f, 0.8068475535437992f, 0.8104571982525948f, 0.8140363297059483f, 0.8175848131515837f, 0.8211025149911046f, 0.8245893027850253f, 0.8280450452577557f, 0.8314696123025451f, 0.83486287498638f, 0.838224705554838f, 0.8415549774368983f, 0.844853565249707f, 0.8481203448032971f, 0.8513551931052652f, 0.8545579883654005f, 0.8577286100002721f, 0.8608669386377672f, 0.8639728561215867f, 0.8670462455156926f, 0.8700869911087113f, 0.87309497841829f, 0.8760700941954065f, 0.8790122264286334f, 0.8819212643483549f, 0.8847970984309378f, 0.8876396204028539f, 0.8904487232447579f, 0.8932243011955153f, 0.8959662497561851f, 0.8986744656939538f, 0.901348847046022f, 0.9039892931234433f, 0.9065957045149153f, 0.9091679830905224f, 0.9117060320054299f, 0.9142097557035307f, 0.9166790599210427f, 0.9191138516900578f, 0.9215140393420419f, 0.9238795325112867f, 0.9262102421383114f, 0.9285060804732155f, 0.9307669610789837f, 0.9329927988347388f, 0.9351835099389475f, 0.937339011912575f, 0.9394592236021899f, 0.9415440651830208f, 0.9435934581619604f, 0.9456073253805213f, 0.9475855910177411f, 0.9495281805930367f, 0.9514350209690083f, 0.9533060403541938f, 0.9551411683057707f, 0.9569403357322089f, 0.9587034748958716f, 0.9604305194155658f, 0.9621214042690416f, 0.9637760657954398f, 0.9653944416976894f, 0.9669764710448521f, 0.9685220942744173f, 0.970031253194544f, 0.9715038909862518f, 0.9729399522055601f, 0.9743393827855759f, 0.9757021300385286f, 0.9770281426577544f, 0.9783173707196277f, 0.9795697656854405f, 0.9807852804032304f, 0.9819638691095552f, 0.9831054874312163f, 0.984210092386929f, 0.9852776423889412f, 0.9863080972445987f, 0.9873014181578584f, 0.9882575677307495f, 0.989176509964781f, 0.9900582102622971f, 0.99090263542778f, 0.9917097536690995f, 0.99247953459871f, 0.9932119492347945f, 0.9939069700023561f, 0.9945645707342554f, 0.9951847266721968f, 0.9957674144676598f, 0.996312612182778f, 0.9968202992911657f, 0.9972904566786902f, 0.9977230666441916f, 0.9981181129001492f, 0.9984755805732948f, 0.9987954562051724f, 0.9990777277526454f, 0.9993223845883495f, 0.9995294175010931f, 0.9996988186962042f, 0.9998305817958234f, 0.9999247018391445f, 0.9999811752826011f, 1.0 };

float float_sine257(float a) {
  unsigned int i = ((unsigned int)(a * (256*8 /_2PI) + 1) >> 1) & 0x3ff;
  if (i < 256) {
    return f_sine_array[i];
  }
  else if(i < 512) {
    return f_sine_array[512 - i];
  }
  else if(i < 768) {
    return -f_sine_array[-512 + i];
  }
  else {
    return -f_sine_array[1024 - i];
  }
}

Although it is not quite clear to me why.

The LUT with 129 elements is unfortunately less accurate than the others (as expected):

Starting...
Initializing CORDIC...
CORDIC initialized.


Timing CORDIC vs stdlib sin vs SimpleFOC Sine calculations...

CORDIC:
CORDIC Time (us) for 3217 steps: 6574
Result: 2048.00

SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 927
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 2713
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 793
Result: 2047.94

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 3150
Result: 2047.98

Float257 Sine:
Float257 Sine time (us) for 3217 steps: 719
Result: 2048.00

Deku257 Sine:
Deku257 Sine time (us) for 3217 steps: 676
Result: 2048.00

Deku129 Sine:
Deku129 Sine time (us) for 3217 steps: 736
Result: 2047.99

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000046
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku256 Sine and stdlib: 0.00125757
RMS difference between Float Sine and stdlib: 0.00125250
RMS difference between Deku257 Sine and stdlib: 0.00125253
RMS difference between Deku129 Sine and stdlib: 0.00250501
Test complete.

I’ll test some more MCUs tonight if I can.

dekutree64 · March 23, 2023, 11:50pm

Wow, I would have thought pure float table would win by a lot. No float-to-int conversion or multiply. It does still have the negation operation on two cases, but that should should only take one cycle to toggle the sign bit. The problem with it is that the table is twice the size. That wouldn’t be a problem in the version with interpolation, but then the integer math is probably enough faster to make up for the conversion to float afterward.

I’m still curious to see the time for the interpolated lookup with multiply by 1/32768 instead of ldexp.

EDIT: I just figured out why I was getting nonsensical code size for _sincos in post #85. I forgot to change the return type from float to void. Apparently the STM32 compiler does not like that… but doesn’t give a warning either. It looks like it takes 270 bytes, so with the space saved by the 17 entry LUT versus the original 200, we should still come out almost 100 bytes ahead. I think this is high enough precision already, but 65 LUT with _sincos should be roughly the same size as the original and much higher precision. Or we could do 33 entries.

I also ran the speed test on Lepton (64MHz STM32G031). I had a bit of trouble keeping the compiler from optimizing everything out, hence the use of volatile variables. And I’m not sure where you came up with the 3217 number, but I went with it. Here is the code:

volatile float s, c;
void TestSpeed()
{
  int i;
  unsigned long startTime, endTime;
  
  startTime = _micros();
  for (i = 0; i < 3217; i++) { s = _sin(i*_2PI/3217); }
  endTime = _micros();
  Serial.print("SimpleFOC _sin time (us) for 3217 steps:");
  Serial.println(endTime - startTime);
  
  startTime = _micros();
  for (i = 0; i < 3217; i++) { s = _sin17(i*_2PI/3217); }
  endTime = _micros();
  Serial.print("Deku _sin17 time (us) for 3217 steps:");
  Serial.println(endTime - startTime);
  
  startTime = _micros();
  for (i = 0; i < 3217; i++) { float a = i*_2PI/3217; s = _sin(a); c = _cos(a); }
  endTime = _micros();
  Serial.print("SimpleFOC _sin, _cos time (us) for 3217 steps:");
  Serial.println(endTime - startTime);
  
  startTime = _micros();
  for (i = 0; i < 3217; i++) { _sincos17(i*_2PI/3217, &s, &c); }
  endTime = _micros();
  Serial.print("Deku _sincos17 time (us) for 3217 steps:");
  Serial.println(endTime - startTime);
  
  startTime = _micros();
  for (i = 0; i < 3217; i++) { _sincos17ldexp(i*_2PI/3217, &s, &c); }
  endTime = _micros();
  Serial.print("Deku _sincos17ldexp time (us) for 3217 steps:");
  Serial.println(endTime - startTime);
}

And here are the results:

SimpleFOC _sin time (us) for 3217 steps:83584
Deku _sin17 time (us) for 3217 steps:127195
SimpleFOC _sin, _cos time (us) for 3217 steps:144529
Deku _sincos17 time (us) for 3217 steps:141850
Deku _sincos17ldexp time (us) for 3217 steps:168877

By the difference between SimpleFOC _sin,_cos and 2x _sin alone, it appears the loop overhead is 22639 microseconds (7 micros per iteration, 450 CPU cycles). Accounting for that, the interpolated sine takes 70% longer than the original, but combined _sincos is almost the same as original _sin and _cos together, 37 microseconds, 2372 CPU cycles. And ldexp is indeed slow.

runger · March 24, 2023, 12:30pm

I’ve run an interpolated version of the 129 entry LUT in the meantime, on STM32G4:

CORDIC:
CORDIC Time (us) for 3217 steps: 6572
Result: 2048.00

SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 926
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 2714
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 792
Result: 2047.94

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 3148
Result: 2047.98

Float257 Sine:
Float257 Sine time (us) for 3217 steps: 719
Result: 2048.00

Deku257 Sine:
Deku257 Sine time (us) for 3217 steps: 676
Result: 2048.00

Deku129 Sine:
Deku129 Sine time (us) for 3217 steps: 736
Result: 2047.99

Deku129i Sine:
Deku129i Sine time (us) for 3217 steps: 931
Result: 2047.93

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000046
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku256 Sine and stdlib: 0.00125757
RMS difference between Float Sine and stdlib: 0.00125250
RMS difference between Deku257 Sine and stdlib: 0.00125253
RMS difference between Deku129 Sine and stdlib: 0.00250501
RMS difference between Deku129i Sine and stdlib: 0.00003220

Looks like its the same speed as SimpleFOC’s _sin(), but with a smaller table size and far better accuracy.

This is the code I ran:

float deku_sin129i(float angle) {
  unsigned int i = (unsigned int)(angle * (128*4*256 /_2PI));
  int a, b, frac = i & 0xff;
  i = (i >> 8) & 0x1ff;
  if (i < 128) {
    a = sine_array3[i]; b = sine_array3[i+1];
  }
  else if(i < 256) {
    a = sine_array3[256 - i]; b = sine_array3[255 - i];
  }
  else if(i < 384) {
    a = sine_array3[-256 + i]; b = sine_array3[-255 + i];
  }
  else {
    a = sine_array3[512 - i]; a = sine_array3[511 - i];
  }
  return (1.0f/32768.0f) * (a + (((b - a) * frac) >> 8));
}

I didn’t get to it last night, but I will keep on it and try different MCUs as I have time…

dekutree64 · March 24, 2023, 1:22pm

Thanks for performing these tests! Very interesting to see results on other platforms.

I wonder why you’re getting so much faster time on the interpolated one relative to SimpleFOC _sin than I am. And so much lower numbers in general. Can I see your test code that calls the sine functions?

I thought the loop overhead on mine seemed awfully high, so I tried changing the angle calculation from i*_2PI/3217 to i*(_2PI/3217) and surprisingly it seems to have eliminated the overhead almost entirely. SimpleFOC _cos takes a little longer than _sin, so it makes sense that _sin,_cos would be a little more than 2x _sin alone.

I also tried my 257 entry non-interpolated lookup. Here are the results:

SimpleFOC _sin time (us) for 3217 steps:60023
SimpleFOC _sin, _cos time (us) for 3217 steps:121214

Deku _sin17 time (us) for 3217 steps:103439
Deku _sincos17 time (us) for 3217 steps:117919

Deku _sin257 time (us) for 3217 steps:50552
Deku _sincos257 time (us) for 3217 steps:64727

So at the cost of a few hundred bytes more flash space, non-interpolated _sincos cuts the time almost in half, while giving slightly better precision and no range restriction.

Next best is non-interpolated _sin and _cos separately, about 20% faster than original, but still needs around 100 bytes more flash space.

Third best is interpolated _sincos, which is only slightly faster than original, but higher precision, no range restriction, and should even save some flash space.

runger · March 24, 2023, 5:23pm

Absolutely, it is this:

#include <Arduino.h>
#include <SimpleFOC.h>
#include "common/foc_utils.h"
#include "./hal_cordic.h"
#include "./deku_sine.h"




void setup() {
  Serial.begin(115200);
  while (!Serial);
  delay(3000);
  Serial.println("Starting...");
  Serial.println("Initializing CORDIC...");
  CORDIC_Config();
  Serial.println("CORDIC initialized.");
  Serial.println();
  Serial.println();
}

void loop() {

  Serial.println("Timing CORDIC vs stdlib sin vs SimpleFOC Sine calculations...");
  Serial.println();

  Serial.println("CORDIC:");
  float step = 1/1024.0f;
  float res = 0.0;
  int steps = 0;
  long ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += cordic_sin(i);
    steps++;
  }
  long ts_end = micros();
  Serial.print("CORDIC Time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("SimpleFOC _sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += _sin(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("SimpleFOC _sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("stdlib sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += sin(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("stdlib sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);


  Serial.println();
  Serial.println("Deku sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += deku_sin256(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("Deku sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);



  Serial.println();
  Serial.println("SimpleFOC sin + normalizeAngle:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += _sin(_normalizeAngle(i));
    steps++;
  }
  ts_end = micros();
  Serial.print("SimpleFOC + normalizeAngle time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);




  Serial.println();
  Serial.println("Float257 Sine:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += float_sine257(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("Float257 Sine time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);


  Serial.println();
  Serial.println("Deku257 Sine:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += deku_sin257(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("Deku257 Sine time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);


  Serial.println();
  Serial.println("Deku129 Sine:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += deku_sin129(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("Deku129 Sine time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);


  Serial.println();
  Serial.println("Deku129i Sine:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += deku_sin129i(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("Deku129i Sine time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);





  Serial.println();
  Serial.println("Comparing accuracy...");
  float rmsdiff1 = 0.0f;
  float rmsdiff2 = 0.0f;
  float rmsdiff3 = 0.0f;
  float rmsdiff4 = 0.0f;
  float rmsdiff5 = 0.0f;
  float rmsdiff6 = 0.0f;
  float rmsdiff7 = 0.0f;
  steps = 0;
  for (float i = 0.0; i < _PI; i+=step) {
    float diff1 = 0.0f;
    float diff2 = 0.0f;
    float diff3 = 0.0f;
    float diff4 = 0.0f;
    float diff5 = 0.0f;
    float diff6 = 0.0f;
    float diff7 = 0.0f;
    float res1 = cordic_sin(i);
    float res2 = _sin(i);
    float res3 = sin(i);
    float res4 = deku_sin256(i);
    float res5 = float_sine257(i);
    float res6 = deku_sin257(i);
    float res7 = deku_sin129(i);
    float res8 = deku_sin129i(i);

    diff1 = res3 - res1;
    if (diff1>1.0) {
      Serial.print("CORDIC vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff1, 8);
    }

    diff2 = res3 - res2;
    if (diff2>1.0) {
      Serial.print("SimFOC vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff2, 8);
    }

    diff3 = res3 - res4;
    if (diff3>1.0) {
      Serial.print("  Deku vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff3, 8);
    }

    diff4 = res3 - res5;
    if (diff4>1.0) {
      Serial.print(" Float vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff4, 8);
    }

    diff5 = res3 - res6;
    if (diff5>1.0) {
      Serial.print(" Deku257 vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff5, 8);
    }

    diff6 = res3 - res7;
    if (diff6>1.0) {
      Serial.print(" Deku129 vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff6, 8);
    }

    diff7 = res3 - res8;
    if (diff6>1.0) {
      Serial.print(" Deku129i vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff7, 8);
    }

    rmsdiff1 += diff1*diff1;
    rmsdiff2 += diff2*diff2;
    rmsdiff3 += diff3*diff3;
    rmsdiff4 += diff4*diff4;
    rmsdiff5 += diff5*diff5;
    rmsdiff6 += diff6*diff6;
    rmsdiff7 += diff7*diff7;
    steps++;
  }
  rmsdiff1 = sqrt(rmsdiff1/steps);
  rmsdiff2 = sqrt(rmsdiff2/steps);
  rmsdiff3 = sqrt(rmsdiff3/steps);
  rmsdiff4 = sqrt(rmsdiff4/steps);
  rmsdiff5 = sqrt(rmsdiff5/steps);
  rmsdiff6 = sqrt(rmsdiff6/steps);
  rmsdiff7 = sqrt(rmsdiff7/steps);
  Serial.print("RMS difference between CORDIC and stdlib: ");
  Serial.println(rmsdiff1, 8);
  Serial.print("RMS difference between SimpleFOC and stdlib: ");
  Serial.println(rmsdiff2, 8);
  Serial.print("RMS difference between Deku256 Sine and stdlib: ");
  Serial.println(rmsdiff3, 8);
  Serial.print("RMS difference between Float Sine and stdlib: ");
  Serial.println(rmsdiff4, 8);
  Serial.print("RMS difference between Deku257 Sine and stdlib: ");
  Serial.println(rmsdiff5, 8);
  Serial.print("RMS difference between Deku129 Sine and stdlib: ");
  Serial.println(rmsdiff6, 8);
  Serial.print("RMS difference between Deku129i Sine and stdlib: ");
  Serial.println(rmsdiff7, 8);

  Serial.println("Test complete.");
  while(1);
}

I’m sure this can be greatly improved, and of course for real benchmarking I should be excercising careful control of all the compiler options (which I’m not), and probably should be timing a different way…

But I think it is enough to give the right indication, and I think its also the way people will usually run it, with the default compiler options for their environment…

dekutree64 · March 25, 2023, 10:30am

Thanks! I tried similar code:

void TestSpeed()
{
  float i, s, c, res;
  float step = 1/1024.0f;
  int steps;
  long ts, ts_end;

  res = 0;
  steps = 0;
  ts = micros();
  for (i = 0.0; i < _PI; i+=step,steps++) res += _sin(i);
  ts_end = micros();
  Serial.print("SimpleFOC _sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);
  
  res = 0;
  steps = 0;
  ts = micros();
  for (i = 0.0; i < _PI; i+=step,steps++) res += _sin(i) + _cos(i);
  ts_end = micros();
  Serial.print("SimpleFOC _sin,_cos time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);
  
  res = 0;
  steps = 0;
  ts = micros();
  for (i = 0.0; i < _PI; i+=step,steps++) res += _sin17(i);
  ts_end = micros();
  Serial.print("Deku _sin17 time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);
  
  res = 0;
  steps = 0;
  ts = micros();
  for (i = 0.0; i < _PI; i+=step,steps++) { _sincos17(i, &s, &c); res += s + c; }
  ts_end = micros();
  Serial.print("Deku _sincos17 time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);
  
  res = 0;
  steps = 0;
  ts = micros();
  for (i = 0.0; i < _PI; i+=step,steps++) res += _sin257(i);
  ts_end = micros();
  Serial.print("Deku _sin257 time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);
  
  res = 0;
  steps = 0;
  ts = micros();
  for (i = 0.0; i < _PI; i+=step,steps++) { _sincos257(i, &s, &c); res += s + c; }
  ts_end = micros();
  Serial.print("Deku _sincos257 time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);
}

and got this:

SimpleFOC _sin time (us) for 3217 steps: 58807
Result: 2047.98
SimpleFOC _sin,_cos time (us) for 3217 steps: 126369
Result: 2048.96
Deku _sin17 time (us) for 3217 steps: 105655
Result: 2046.31
Deku _sincos17 time (us) for 3217 steps: 126432
Result: 2047.64
Deku _sin257 time (us) for 3217 steps: 52643
Result: 2047.99
Deku _sincos257 time (us) for 3217 steps: 73643
Result: 2048.97

And out of curiosity I also tried compiling with -O3 (fastest) as opposed to my usual -Os (smallest), but the times were nearly the same.

The times probably are correct after all, it’s just that 170MHz with FPU really is 60x faster Although it is surprising that your ATMega328 numbers are less than 4x slower. The clock frequency alone should be that much, and being 8-bit versus 32-bit ARM should make it even worse.

The time differences from my previous post are a bit strange though. I would expect the float addition res+= to increase the time a bit, but not that the sincos cases would be so much slower, and SimpleFOC _sin would be faster.

runger · March 25, 2023, 11:18pm

Here are some results from the original UNO. I’ve removed the 256, 257 and float LUTs, they’re too big for this MCU.

On the ATMega, it seems ldexp() outperforms the float multiplications/divisions. That also explains why the _normalizeAngle() function has a relatively low impact on this MCU.

What’s disappointing is that the interpolated versions are less accurate than the non-interpolated. This is – presumably – due to accuracy issues introduced by 16 bit int overflows on this processor. The bit-shift/multiply by fraction exceeds the capacity of int. I have to see if I can do something about it.

Timing Deku vs stdlib sin vs SimpleFOC Sine calculations...


SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 206584
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 415776
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 210444
Result: 2047.88

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 230048
Result: 2047.98

Deku129 Sine:
Deku129 Sine time (us) for 3217 steps: 151420
Result: 2047.99

Deku129i Sine:
Deku129i Sine time (us) for 3217 steps: 205868
Result: 2060.47

Comparing accuracy...
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku65i and stdlib: 0.01027036
RMS difference between Deku129 Sine and stdlib: 0.00250501
RMS difference between Deku129i Sine and stdlib: 0.00693468
Test complete.

Speed winner on ATMega so far:

float deku_sin129_2(float a) {
  unsigned int i = ((unsigned int)(a * (128*8 /_2PI) + 1) >> 1) & 0x1ff;
  if (i < 128) {
    return ldexp(sine_array3[i],-15);
  }
  else if(i < 256) {
    return ldexp(sine_array3[256 - i],-15);
  }
  else if(i < 384) {
    return -ldexp(sine_array3[-256 + i],-15);
  }
  else {
    return -ldexp(sine_array3[512 - i],-15);
  }
}

runger · March 26, 2023, 12:12am

Ok, going to int32_t as the type for the fractional calculation fixes the accuracy issues, and incurs only a small speed penalty. Apparently the ATMega isn’t so slow at 32 bit integers even though it’s a 8 bit MCU.

The numbers seem to make sense - the ATMega is 200 times slower than the G4 MCU for example.

SimpleFOC _sin:
SimpleFOC _sin time (us) for 3217 steps: 206584
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 415776
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 229888
Result: 2047.85

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 230044
Result: 2047.98

Deku129 Sine:
Deku129 Sine time (us) for 3217 steps: 151420
Result: 2047.99

Deku129i Sine:
Deku129i Sine time (us) for 3217 steps: 228460
Result: 2047.93

Comparing accuracy...
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku65i and stdlib: 0.00006479
RMS difference between Deku129 Sine and stdlib: 0.00250501
RMS difference between Deku129i Sine and stdlib: 0.00003218
Test complete.

The interpolated version:

float deku_sin129i(float angle) {
  uint16_t i = (uint16_t)(ldexp(angle,17)/_2PI);
  int32_t a, b, frac = i & 0xff;
  i = (i >> 8) & 0x1ff;
  if (i < 128) {
    a = sine_array3[i]; b = sine_array3[i+1];
  }
  else if(i < 256) {
    a = sine_array3[256 - i]; b = sine_array3[255 - i];
  }
  else if(i < 384) {
    a = -sine_array3[-256 + i]; b = -sine_array3[-255 + i];
  }
  else {
    a = -sine_array3[512 - i]; b = -sine_array3[511 - i];
  }
  return ldexp(a + (((b - a) * frac) >> 8), -15);
}

Actually I think the 65 entry LUT is accurate enough, more than an order of magnitude better than the original _sin().

I think I’m going to suggest that we include the 65 entry interpolated version on ATMega, and the 257 entry non-interpolated version on 32 bit architectures.
This will represent an upgrade in terms of all three: performance, space used (due to uint16_t) and accuracy, on all the MCU types I’ve tried out so far. In addition the functionality is also improved because _sin() and _cos() will now have built-in normalisation and don’t have to be fed values only in the range 0-2PI.
With this improvement, plus the new ability to supply your own _sin() implementation due to weak bindings, I think we can close off the sine/cosine calculation topic again?

And as for the CORDIC, which started this whole thing, we’ve now done the testing to confirm it is slower (expected) but far more accurate (unexpected, to me at least) than the LUT version. For anyone who needs to use it, they can do so now easily by supplying their custom _sin() and _cos() functions.

dekutree64 · March 26, 2023, 1:14am

Oh right, I had forgotten the potential for overflow in the interpolated version with a 16-bit int.

If you want to go really overboard, you could use inline assembly for the AVR version to do that kind of multiply-and-shift much faster Here’s a function from my old drone ESC firmware:

// Multiply s16 by u8 and keep the upper 16 bits of the result
// This is equivalent to (s16)((((s32)maxPos) * pulse) >> 8), but much
// faster because the compiler calls the full 32-bit multiply function.
STIN s16 ScaleRCPulse(s16 maxPos, u8 pulse)
{
	s16 scaled;

	asm volatile(
	"push r0\n\t"
	"push r1\n\t"
	"mul %A1, %2\n\t" //r0:r1 = low bits of maxPos * pulse
	"mov %A0, r1\n\t" // low bits of return value = high bits of result
	"clr %B0\n\t" // high bits of return value = 0
	"mulsu %B1, %2\n\t" // r0:r1 = high bits of maxPos * pulse
	// Add result to return value
	"add %A0, r0\n\t"
	"adc %B0, r1\n\t"
	"pop r1\n\t"
	"pop r0\n\t"
	: "=r" (scaled) : "a" (maxPos), "a" (pulse));

	return scaled;
}

As for which version to use, I think 65 interpolated is excessively high precision. 17 interpolated is already around 1/3 the error of the original, so 33 should be more than enough and I think will leave room for the combined _sincos. At the very least I would appreciate if the library has a stub _sincos that just calls the other two, so I can override it with the fastest 257 non-interpolated _sincos in my own programs. Unless you have a strong preference for keeping the _sin and _cos calls separate, in which case do as you please As you can probably tell, I have an unhealthy obsession with speed optimization…

runger · March 29, 2023, 9:53pm

That’s how I’d do it, but obviously we’d have to then use that function in the code, which would make it that much harder to understand. I’ll discuss it with Antun

Results from a MKR1000 board (SAMD21 48MHz cortex M0+):

SimpleFOC _sin time (us) for 3217 steps: 84062
Result: 2047.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 298406
Result: 2048.00

Deku sin:
Deku sin time (us) for 3217 steps: 77021
Result: 2047.94

SimpleFOC sin + normalizeAngle:
SimpleFOC + normalizeAngle time (us) for 3217 steps: 104882
Result: 2047.98

Float257 Sine:
Float257 Sine time (us) for 3217 steps: 57642
Result: 2048.00

Deku257 Sine:
Deku257 Sine time (us) for 3217 steps: 76499
Result: 2048.00

Deku129 Sine:
Deku129 Sine time (us) for 3217 steps: 76317
Result: 2047.99

Deku129i Sine:
Deku129i Sine time (us) for 3217 steps: 68874
Result: 2047.93

ARM Sine:
ARM Sine time (us) for 3217 steps: 153212
Result: 2047.97

Comparing accuracy...
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku256 Sine and stdlib: 0.00125757
RMS difference between Float Sine and stdlib: 0.00125250
RMS difference between Deku257 Sine and stdlib: 0.00125253
RMS difference between Deku129 Sine and stdlib: 0.00250501
RMS difference between Deku129i Sine and stdlib: 0.00003220
RMS difference between ARM Sine and stdlib: 0.00000971
Test complete.

They follow the same pattern. Interestingly, here the float variant is best, and the results compared to ATMega are between 2-4x faster. But the G4 is still massively faster.

I’m guessing while there will be some variation, the ARM 32 bit MCUs are pretty much going to follow this pattern.

That’s really cool, but that’s what the new weak bindings are for - whoever wants to do it with assembler can now do so in their own code. But I think we won’t take it so far in the library itself, that would really take the simple out of SimpleFOC

Candas1 · April 3, 2023, 8:19am

Have you guys checked this ?
I had to use it in the past to reduce code size, but it can also be used for speed.

robca · April 26, 2023, 12:23am

I just found out this thread and still digesting it, but wanted to pass along a better way to measure performance using a much better resolution than micros().

uint32_t start;
uint32_t stop;
uint32_t elapsed;
// enable DWT
CoreDebug->DEMCR |= 0x01000000;
// Reset cycle counter
DWT->CYCCNT = 0;
// enable cycle counter
DWT->CTRL |= 0x1;
start = DWT->CYCCNT;

// code to be measured here

stop = DWT->CYCCNT;
elapsed = stop-start;

That counts the clock cycles used, so it’s as granular as you can get on an ARM processor.

Keep in mind that the CYCCNT counter is 32 bit, so for very long times can reset to 0 and restart. But for anything less than 20 seconds at 170MHz, not a problem

robca · April 26, 2023, 10:30pm

One more question/comment, sorry.

In the past, I was looking to execute a convolution filter on various waveforms to time-align sounds on an STM32 processor. After a lot of research, optimization and endless timing loops, it turns out that the CMSIS DSP library is hard to beat.CMSIS DSP Software Library

Especially when compiled with -O3, the code generated runs almost exclusively in the processor registers and it’s way faster than the C source code would imply. It supports quite a lot of fixed point types, conversions between formats and even has Park and Clarke transforms (and their inverse) Controller Functions

The CMSIS DSP library works on M0, M3 and M4 Cortex cores, and when used on a CM4 processor, fully takes advantage of the FP coprocessor and SIMD instructions. The nice thing is that the same library can be used for M0, M3 and M4 cores, and it optimizes as necessary

Granted, it would break compatibility with Atmega and ESP32 cores, but the speed gain for any ARM core would be significant, especially if the library used a format like Q31 internally

Is this something worth exploring more (CORDIC and CMSIS math, I mean)? Or is the time spent calculating the parameters small enough not to make a meaningful difference?

CORDICis very specific to STM32G4xx processors, while the CMSIS library is a generic ARM library. But if the FOC library used Q31, then adding CORDIC when the right processor is present could speed things up even further

Juan-Antonio_Soren_E · April 26, 2023, 10:54pm

That is really interesting. I’ll try to change compiler settings and see how that goes. Thx ! I’m not sure the SFOC staff is going to abandon floats for Q31. I did manage to use the CORDIC w. SFOC but the conversion to float makes it ~2 micros slower then look-op-table.

I would love to send you a prototype @some-point if you would like to do your worst to optimize execution?