Embedded World 2023 - STM32 CORDIC CO-PROCESSOR

dekutree64 · March 18, 2023, 3:34am

Nice work! That’s more like what I would expect for lookup table versus stdlib sin(). But the CORDIC is surprisingly slow. Even in a fixed-point based SimpleFOC without the float conversion, the good old lookup table would win again Especially since it would be that much faster with fixed-point interpolation and no float-int conversion.

Nintendo DS had a math coprocessor kind of like that too, which did division and square roots. If you waited for the result it was only marginally better than doing them in software. What made it useful is that is was asynchronous so you could do other things while waiting for the computation to complete. But of course that makes the code harder to follow and non-portable.

Juan-Antonio_Soren_E · March 18, 2023, 3:37am

@runger, could you try q1.15

Maybe you can push the clock on the CORDIC with the 24Mhz crystal

Part of the HARDWARE advantage the CORDIC gives is, it takes off calculations from the MCU, thereby decreasing power consumption and ultimately heat development, in a highly specialized manor

Juan-Antonio_Soren_E · March 18, 2023, 1:54pm

I had to look at this again.

cos();
sin();

Iterations:   6284
elapsed_ticks:   41390
time per iterasion:   6

_cos();
_sin():

Iterations:   6284
elapsed_ticks:   15341
time per iterasion:   2

CORDIC w. conversion

Iterations:   6284
elapsed_ticks:   144022
time per iterasion:   22

@runger [Edit] Using micros(); the CORDIC is still faster in my results. Se below.

The question is how it transfers to actual use. How often would we pull this data inside the FOC loop? What about the precision compared to speed in real life applications ?

SimpleFOC _sin() and _cos()

Iterations:   6284
elapsed_ticks:   24432.0000
time in micro_secunds per iteration:   3.8880

CORDIC w. conversion from to float

Iterations:   6284
elapsed_ticks:   5705.0000
time in micro_secunds per iteration:   0.9079

Ok, now im just confused. (The timing seem right. when i swap the stop_ticks-start_ticks, the number is just negative in both instances).

If I only run the _sin() function, timing with micros();:

//angle2 = angle;
//angle2 = _normalizeAngle(angle2);
//float _ca = _cos(angle2);
float _sa = _sin(angle);

Iterations:   6284
elapsed_ticks:   6114.0000
time in micro_secunds per iteration:   0.9729

Is maybe the dev branch faster than the main sFOC branch ?

Validating with twice the iterations

for (angle = -pi*2; angle <= pi*2; angle += 0.001, count++)

SFOC _sin(only):

Iterations:   12567
elapsed_ticks:   12466.0000
time in micro_secunds per iterasion:   0.9920

SFOC w. normalize, _sin() and _cos():

Iterations:   12567
elapsed_ticks:   49023.0000
time in micro_secunds per iterasion:   3.9009

CORDIC w. conversion from/to float:

Iterations:   12567
elapsed_ticks:   11412.0000
time in micro_secunds per iterasion:   0.9081

Juan-Antonio_Soren_E · March 18, 2023, 3:59pm

This is my CORDIC loop:

int32_t float_to_q31(float input) {
    int32_t q31 = (int32_t) (input * 2147483648.0f);
    q31 = (q31 > 0) ? (q31 << 1) : (-q31 << 1);
    return q31;
}

long start_ticks = 0;
long stop_ticks = 0;
float elapsed_ticks = 0.0f;
...

start_ticks = micros();
for (angle = -pi*2; angle <= pi*2; angle += 0.001, count++) {

  q31_value = float_to_q31(angle / (2.0f * pi));

CORDIC->WDATA = q31_value;
cordic_sine = CORDIC->RDATA;
cordic_cosine = CORDIC->RDATA;

value_f32_sine = (float)cordic_sine/(float)0x80000000;
value_f32_cosine = (float)cordic_cosine/(float)0x80000000;

}

stop_ticks = micros(); 

elapsed_ticks = stop_ticks-start_ticks;


Serial.print("Iterations:   ");
Serial.println(count);
Serial.print("elapsed_ticks:   ");
Serial.println(elapsed_ticks, 4);
Serial.print("time per iterasion:   ");
Serial.println(elapsed_ticks / count, 4);

Juan-Antonio_Soren_E · March 18, 2023, 4:09pm

Now, lets put this into perspective

Are these times running without FPU? (Floating Point Unit).

// Function using sine approximation
// regular sin + cos ~300us    (no memory usaage)
// approx  _sin + _cos ~110us  (400Byte ~ 20% of memory)
// The ST CORDIC can do this calculation and conversion to/from float in 1us
void StepperMotor::setPhaseVoltage(float Uq, float Ud, float angle_el) {
  // Sinusoidal PWM modulation
  // Inverse Park transformation

  // angle normalization in between 0 and 2pi
  // only necessary if using _sin and _cos - approximation functions
  angle_el = _normalizeAngle(angle_el);
  float _ca = _cos(angle_el);
  float _sa = _sin(angle_el);
  // Inverse park transform
  Ualpha =  _ca * Ud - _sa * Uq;  // -sin(angle) * Uq;
  Ubeta =  _sa * Ud + _ca * Uq;    //  cos(angle) * Uq;

  // set the voltages in hardware
  driver->setPwm(Ualpha, Ubeta);
}

Juan-Antonio_Soren_E · March 18, 2023, 7:07pm

@runger

Ok this look a bit weird?

Juan-Antonio_Soren_E · March 18, 2023, 10:22pm

This is a reference curve made with sFOC _sin() & _cos()

Here is the CORDIC output I get by wrapping the values.

I must say, that looking at the plotting, SFOC (_sin()/_cos()) lines look visually stept, while the CORDIC output is 100% smooth,

runger · March 18, 2023, 11:45pm

Hey, I was up late last night on this topic, as you saw

The final code I was using is this. I didn’t test it on negative inputs for the CORDIC, since SimpleFOC _sin() only likes positive inputs.

CORDIC functions:


#include "Arduino.h"
#include "stm32g4xx_hal_cordic.h"
#include "common/foc_utils.h"

CORDIC_HandleTypeDef thisCordic;

#define float2q31(v)	\
            ((v < 0.0) ? ((int32_t)(0x80000000 * (v) - 0.5) | 0x80000000) \
            : (int32_t)(0x80000000 * (v) + 0.5))

#define q312float(v)	\
            ((v < 0.0) ? -((float)(v&0x7FFFFFFF) / 0x80000000) \
            : ((float)(v) / 0x80000000))



void CORDIC_Config(void) {
    __HAL_RCC_CORDIC_CLK_ENABLE();
    CORDIC_ConfigTypeDef sConfig;
    thisCordic.Instance = CORDIC;
    if (HAL_CORDIC_Init(&thisCordic) != HAL_OK) {
        Error_Handler();
        Serial.println("CORDIC init error");
    }

    sConfig.Function = CORDIC_FUNCTION_SINE;  
    sConfig.Precision = CORDIC_PRECISION_15CYCLES;
    sConfig.Scale = CORDIC_SCALE_0;
    sConfig.NbWrite = CORDIC_NBWRITE_1;
    sConfig.NbRead = CORDIC_NBREAD_1;
    sConfig.InSize = CORDIC_INSIZE_32BITS;
    sConfig.OutSize = CORDIC_OUTSIZE_32BITS;
    if (HAL_CORDIC_Configure(&thisCordic, &sConfig) != HAL_OK) {
        /* Channel Configuration Error */
        Error_Handler();
        Serial.println("CORDIC config error");
    }
}

float cordic_sin(float input) {
    int32_t input32;
    int32_t output32;

    input32 = float2q31(input/_PI);

    if (HAL_CORDIC_Calculate(&thisCordic, &input32, &output32, 1, 1000) != HAL_OK) {
        /* Processing Error */
        Error_Handler();
        Serial.println("CORDIC calc error");
    }

    return q312float(output32);
}

main.cpp:

#include <Arduino.h>
#include <SimpleFOC.h>
#include "common/foc_utils.h"
#include "./hal_cordic.h"

void setup() {
  Serial.begin(115200);
  while (!Serial);
  delay(3000);
  Serial.println("Starting...");
  Serial.println("Initializing CORDIC...");
  CORDIC_Config();
  Serial.println("CORDIC initialized.");
  Serial.println();
  Serial.println();
}

void loop() {

  Serial.println("Timing CORDIC vs stdlib sin vs SimpleFOC Sine calculations...");
  Serial.println();

  Serial.println("CORDIC:");
  float step = 1/1024.0f;
  float res = 0.0;
  int steps = 0;
  long ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += cordic_sin(i);
    steps++;
  }
  long ts_end = micros();
  Serial.print("CORDIC Time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("SimpleFOC _sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += _sin(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("SimpleFOC _sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("stdlib sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i+=step) {
    res += sin(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("stdlib sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("Comparing accuracy...");
  float rmsdiff1 = 0.0f;
  float rmsdiff2 = 0.0f;
  steps = 0;
  for (float i = 0.0; i < _PI; i+=step) {
    float diff1 = 0.0f;
    float diff2 = 0.0f;
    float res1 = cordic_sin(i);
    float res2 = _sin(i);
    float res3 = sin(i);

    diff1 = res3 - res1;
    if (diff1>1.0) {
      Serial.print("CORDIC vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff1, 8);
    }

    diff2 = res3 - res2;
    if (diff2>1.0) {
      Serial.print("SimFOC vs stdlib at i=");
      Serial.print(i,8);
      Serial.print(": ");
      Serial.println(diff2, 8);
    }

    rmsdiff1 += diff1*diff1;
    rmsdiff2 += diff2*diff2;
    steps++;
  }
  rmsdiff1 = sqrt(rmsdiff1/steps);
  rmsdiff2 = sqrt(rmsdiff2/steps);
  Serial.print("RMS difference between CORDIC and stdlib: ");
  Serial.println(rmsdiff1, 8);
  Serial.print("RMS difference between SimpleFOC and stdlib: ");
  Serial.println(rmsdiff2, 8);

  Serial.println("Test complete.");
  while(1);
}

In the end, the HAL_CORDIC_Calculate function waits in a tight loop on the ready register, it doesn’t look like its implementation is that inefficient. That makes me wonder why there is no difference between 3, 6 and 15 iteration accuracy in terms of the timings.

So maybe there is still some fixes that can be made to my code?

But to be honest I’m not confident the CORDIC approach will be faster than the lookup table, even if we optimize the hell out of it…

But it certainly looks like it is more accurate.

What’s interesting is that on this MCU at least the stdlib sin() implementation isn’t that slow - only 3x slower than the lookup table. For someone doing slow-turning positioning and aiming for accuracy, this might be interesting?

Juan-Antonio_Soren_E · March 18, 2023, 11:47pm

You could try the Zero overhead

if (HAL_CORDIC_CalculateZO

and read both sinus and cosinus →

sConfig.NbRead = CORDIC_NBREAD_2;

Juan-Antonio_Soren_E · March 19, 2023, 12:16am

This is the output of your write-up, without wrapping.

 q31_value = float2q31(angle / pi);

Iterations:   6284
elapsed_ticks:   10398.0000
time per iterasion:   1.6547

This is with the conversion by OpenAI and the wrapping for correcting the curve.

Iterations:   6284
elapsed_ticks:   8000.0000
time per iterasion:   1.2731

Juan-Antonio_Soren_E · March 19, 2023, 4:11pm

Visible light is in the range 400 to 700 THz.

One THz is equal to 1.000.000 MHz

Juan-Antonio_Soren_E · March 20, 2023, 12:40pm

Just out of curiosity i ran the sin() and cos() like this:
[Note] Normalization is not needed.

#include <math.h>

...

float _ca = cos(angle);
float _sa = sin(angle);

Iterations:   6284
elapsed_ticks:   4694.0000
time per iterasion:   0.7470

sin() & cos() output:

So, in this scenario, the only reason to use the CORDIC, would be to offload the MCU while doing other time sensitive stuff.

[Remark] Looking close at the curves, it seems as though there is a small glitch at -0.5 ?

Now, this is interesting.

Running _sin() and _cos() without normalization, only with positive numbers, like this:
[Note] This is just half a circle, a radian is -pi to pi.

for (angle = 0; angle <= pi; angle += 0.0005, count++) {

float _ca = _cos(angle);
float _sa = _sin(angle);

}

In micros();

Iterations:   6284
elapsed_ticks:   8330.0000
time per iterasion:   1.3256

To be fair, for future reference, and because it would be un-scientific to conduct these test on a unrealistic premise, I ran just the positive numbers like this, as is the SFOC implementation.

angle = _normalizeAngle(angle);
float _ca = _cos(angle);
float _sa = _sin(angle);

Iterations:   6284
elapsed_ticks:   24390.0000
time per iterasion:   3.8813

Juan-Antonio_Soren_E · March 20, 2023, 1:51pm

To be honest, looking at how SFOC current implementations is w. Normalization, the CORDIC is not only more precise it is also 3 times faster.

Juan-Antonio_Soren_E · March 20, 2023, 8:25pm

I did se that. Thank you for helping me set this up. It was a fun ride, you folks are amazing with code.

The Timers of the g431, with its encoder interface is gonna mess some more with the repo. you better believe it

65536

So, with 16bittimers on each pin, I guess we would have to actually scale down to reach 21 bit ABZ ?

Four states, plus the Zero pin.

Max encoder resolution = Operating Frequency x 60 / Max RPM

Exceeding this number will overwork the encoder’s processing capability, which will result in degraded signal output and cumulative error.

For example, if the encoder’s operating frequency is 125kHz and the maximum shaft speed is 1,000 RPM, the encoder ppr calculation for the maximum resolution the encoder supports is 7,500 pulses per revolution (PPR).

If your encoders standard resolution doesn’t meet your application needs, there is another alternative based on how the signal is decoded through the users drive, PLC or Controller. Assuming the usage of quadrature encoders with bidirectional output (A and B channels), triggering from the rising and falling edges of the A channel and the B channel will generate four times as many pulses, or 4X encoding

VIPQualityPost · March 20, 2023, 9:36pm

This is such an interesting thread. I don’t have any coding proficiency to join in, but it’s great to follow along. I wonder if the VESC folk can also find interest in this. They seem to have lots of time on their hands for motor control optimization already

Juan-Antonio_Soren_E · March 20, 2023, 9:41pm

They are welcome to reach out

runger · March 20, 2023, 11:47pm

This is a good find. One would have to check how a fmod(angle, _2PI) gets compiled, but maybe this is quite inefficient?
It would probably be quite possible to replace the radian based angle values with ones that are not in radians, but which (similar to the CORDIC fixed point values) represent angles as multiples/fractions of π… but I think this would make the code quite hard to understand.
Another approach might be to check how it is used - I think changing the way the function is used might mean we can get rid of the fmod() in favour of a simple subtraction.

But unfortunately at the moment this angle normalisation step is needed regardless of the method used to compute sin/cos. So for doing the comparison, if you add it to the SimpleFOC times, you have to add it to the CORDIC times too.

dekutree64 · March 21, 2023, 1:01am

I think I have a solution… make _sin and _cos accept angles outside 0-2pi range, and inside the function convert to fixed-point before doing the table lookup. It has to do a float-to-int conversion anyway, and by doing it before the normalization step we can use bit shifting/masking instead of fmod and checking for negative.

It will require some care and testing to ensure that there are no off-by-one problems in each quadrant. And the sine array will need to be re-generated with 256 entries instead of 200. I can do the work if you’d like.

EDIT: I went ahead and did it. Here is our new super speedy and slightly higher resolution sine lookup Also cosine can be moved to foc_utils.h as inline float _cos(float a) { return _sin(a + _PI_2); }
I also discovered a minor problem with the old one, the sine_array was type int, which is 16 bits on AVR but 32 bits on ARM, so was wasting 400 bytes of flash.
All the code that calls _sin and _cos will need to be evaluated for whether associated _normalizeAngle calls can be removed.

// int array instead of float array
// 4x256 points per 360 deg
// 2x storage save (int 2Byte float 4 Byte )
// Generation code:
//	FILE *file = fopen("SineFile.txt", "wt");
//	for(int i = 0; i <= 256; i++) fprintf(file, "%i,", (int)(sin(i * M_PI / 512) * 10000 + 0.5));
//	fclose(file);
uint16_t sine_array[257] = {0,61,123,184,245,307,368,429,491,552,613,674,736,797,858,919,980,1041,1102,1163,1224,1285,1346,1407,1467,1528,1589,1649,1710,1770,1830,1891,1951,2011,2071,2131,2191,2251,2311,2370,2430,2489,2549,2608,2667,2726,2785,2844,2903,2962,3020,3078,3137,3195,3253,3311,3369,3427,3484,3542,3599,3656,3713,3770,3827,3883,3940,3996,4052,4108,4164,4220,4276,4331,4386,4441,4496,4551,4605,4660,4714,4768,4822,4876,4929,4982,5035,5088,5141,5194,5246,5298,5350,5402,5453,5505,5556,5607,5657,5708,5758,5808,5858,5908,5957,6006,6055,6104,6152,6201,6249,6296,6344,6391,6438,6485,6532,6578,6624,6670,6716,6761,6806,6851,6895,6940,6984,7028,7071,7114,7157,7200,7242,7285,7327,7368,7410,7451,7491,7532,7572,7612,7652,7691,7730,7769,7807,7846,7883,7921,7958,7995,8032,8068,8105,8140,8176,8211,8246,8280,8315,8349,8382,8416,8449,8481,8514,8546,8577,8609,8640,8670,8701,8731,8761,8790,8819,8848,8876,8904,8932,8960,8987,9013,9040,9066,9092,9117,9142,9167,9191,9215,9239,9262,9285,9308,9330,9352,9373,9395,9415,9436,9456,9476,9495,9514,9533,9551,9569,9587,9604,9621,9638,9654,9670,9685,9700,9715,9729,9743,9757,9770,9783,9796,9808,9820,9831,9842,9853,9863,9873,9883,9892,9901,9909,9917,9925,9932,9939,9946,9952,9958,9963,9968,9973,9977,9981,9985,9988,9991,9993,9995,9997,9998,9999,10000,10000};

// function approximating the sine calculation by using fixed size array
// precision +-0.003108
// a is in radians. No restriction on range.
float _sin(float a) {
  unsigned int i = ((unsigned int)(a * (2048 / _2PI) + 1) >> 1) & 0x3ff;
  if (i < 256)
    return 0.0001f*sine_array[i];
  else if(i < 512)
    return 0.0001f*sine_array[512 - i];
  else if(i < 768)
    return -0.0001f*sine_array[-512 + i];
  else
    return -0.0001f*sine_array[1024 - i];
}

EDIT2: Better yet, we can do sine and cosine with a single float-to-int conversion and series of if statements

void _sincos(float a, float *s, float *c) {
  unsigned int i = ((unsigned int)(a * (2048 / _2PI) + 1) >> 1) & 0x3ff;
  if (i < 256) {
    *s = 0.0001f*sine_array[i];
    *c = 0.0001f*sine_array[256 - i];
  }
  else if(i < 512) {
    *s = 0.0001f*sine_array[512 - i];
    *c = -0.0001f*sine_array[-256 + i];
  }
  else if(i < 768) {
    *s = -0.0001f*sine_array[-512 + i];
    *c = -0.0001f*sine_array[768 - i];
  }
  else {
    *s = -0.0001f*sine_array[1024 - i];
    *c = 0.0001f*sine_array[-768 + i];
  }
}

From my evaluation, the biggest issue is that the velocity/angle PID loops operate on angle values, so pid.cpp/h would have to be converted to fixed-point math or else would be very slow with all the float/int conversions. But the current sense PID would still need to operate on float values, so then it would be slow with conversions…

User code doing arithmetic on angle values would need to be aware of the caveats of fixed-point math, which might intimidate new users who are inexperienced with programming in general. But as long as we have functions or macros for doing multiplication/division, it would probably be ok. And a lot of things would be easier rather than more difficult, as I said back in post #16.

Another factor is that fixed-point division can be even slower than float division, so all the internal code will need to be evaluated for whether divisions can be eliminated or carefully done without overflowing 32 bits or losing too much precision.

runger · March 21, 2023, 9:13am

Woah, that was fast!

Why not change the scale from 10000x to 8192 or 16384 also? Then the multiplication with 0.0001 could become a shift with ldexp()… but I’m not sure if it would really be faster than a multiplication on the FPU…
In fact, if the LUT type is uint_16, shouldn’t we be using 16bits ideally, to keep the most bits of precision?

Also why does your array have 257 elements? That’s one too many?

Did you run any accuracy tests comparing it to the old _sin() and the stdlib sin()?

I will think a bit about the best way to integrate this and consult with Antun also. Its not so simple to go from 200 entries to 256 entries for example, as this will mean many of the examples won’t compile any more for Arduino UNO (or the Lepton), where we are pretty much at the memory limit already…

Yeah, I noted that also, but usually it’s only on the 8-bit MCUs we are tight on memory. However, Lepton and BluePill boards would also benefit from saving a few bytes

To be clear, I don’t have any intention to switch the code-base to fixed point maths, and I doubt Antun would like the idea… the main aim of SimpleFOC is to be simple and understandable, to help people learn about FOC and BLDC control. Making the code-base complex would be counter to that goal.

For people who want highly optimised motor control code, there are offerings from STMicro, TI and others which are already tuned to their MCUs…

dekutree64 · March 21, 2023, 10:23am

As in the comment, the precision measured against stdlib (on PC) is ±0.003108. I only kept the 0.0001 multiplier because that’s how it was before, and is already plenty of precision. But that’s a nice idea using ldexp. I’ll do some speed testing and see which way would be faster.

The array has 257 entries because otherwise the wave has to turn around before it reaches the exact 1.0 at pi/2. The previous entry rounds up to 1.0 anyway, but if you generated the array with higher precision you would see that it is actually slightly below 1. You can change the generation code to sin(i * M_PI/2 / 255) * 10000 + 0.5 to fit in 256 entries and still retain ±0.003109 precision, but the lookup code becomes:

unsigned short sine_array[256] = {0,62,123,185,246,308,370,431,493,554,616,677,739,800,861,923,984,1045,1107,1168,1229,1290,1351,1412,1473,1534,1595,1656,1716,1777,1837,1898,1958,2019,2079,2139,2199,2260,2319,2379,2439,2499,2558,2618,2677,2737,2796,2855,2914,2973,3032,3090,3149,3207,3265,3324,3382,3439,3497,3555,3612,3670,3727,3784,3841,3898,3955,4011,4067,4124,4180,4235,4291,4347,4402,4457,4512,4567,4622,4677,4731,4785,4839,4893,4947,5000,5053,5106,5159,5212,5264,5317,5369,5421,5472,5524,5575,5626,5677,5727,5778,5828,5878,5928,5977,6026,6075,6124,6173,6221,6269,6317,6365,6412,6459,6506,6553,6599,6645,6691,6737,6782,6827,6872,6917,6961,7005,7049,7093,7136,7179,7222,7264,7307,7348,7390,7431,7473,7513,7554,7594,7634,7674,7713,7752,7791,7829,7867,7905,7943,7980,8017,8054,8090,8126,8162,8197,8233,8267,8302,8336,8370,8403,8437,8470,8502,8534,8566,8598,8629,8660,8691,8721,8751,8781,8810,8839,8868,8896,8924,8952,8979,9006,9032,9059,9085,9110,9135,9160,9185,9209,9233,9256,9280,9302,9325,9347,9369,9390,9411,9432,9452,9472,9491,9511,9529,9548,9566,9584,9601,9618,9635,9651,9667,9683,9698,9713,9727,9741,9755,9768,9781,9794,9806,9818,9830,9841,9852,9862,9872,9882,9891,9900,9908,9916,9924,9932,9939,9945,9951,9957,9963,9968,9973,9977,9981,9985,9988,9991,9993,9995,9997,9998,9999,10000,10000};

void _sincos(float a, float *sin, float *cos) {
  unsigned int i = ((unsigned int)(a * (255*8 / (M_PI*2)) + 1) >> 1) & 0x3ff;
  if (i < 255) {
    *sin = 0.0001f*sine_array[i];
    *cos = 0.0001f*sine_array[255 - i];
  }
  else if(i < 511) {
    *sin = 0.0001f*sine_array[510 - i];
    *cos = -0.0001f*sine_array[-255 + i];
  }
  else if(i < 766) {
    *sin = -0.0001f*sine_array[-510 + i];
    *cos = -0.0001f*sine_array[765 - i];
  }
  else {
    *sin = -0.0001f*sine_array[1020 - i];
    *cos = 0.0001f*sine_array[-765 + i];
  }
}

On ARM those constants can’t fit in an 8-bit shifted immediate value so it will be slower and take more than 2 bytes of additional code space to store them. I’m less knowledgeable about the details of compiled code for AVR, so I’m not sure if there would be any difference on it.

You could certainly save some space by skipping the combined _sincos. And the simplified _cos will save a bit compared to the old one. The only way around using a 256 (or 257) entry table would be to interpolate between entries. Then 64 would probably be ok. I’ll give it a try.

It certainly won’t overflow Lepton, since the array is actually 286 bytes smaller with the proper data type now

EDIT: Yep, that works Actually better precision, ±0.000243. And the constants all easily fit into 8 bits now, so you could use a 64 entry table if that 65th one really bothers you…

uint16_t sine_array[65] = {0,245,491,736,980,1224,1467,1710,1951,2191,2430,2667,2903,3137,3369,3599,3827,4052,4276,4496,4714,4929,5141,5350,5556,5758,5957,6152,6344,6532,6716,6895,7071,7242,7410,7572,7730,7883,8032,8176,8315,8449,8577,8701,8819,8932,9040,9142,9239,9330,9415,9495,9569,9638,9700,9757,9808,9853,9892,9925,9952,9973,9988,9997,10000};

uint16_t lerp(uint16_t a, uint16_t b, uint16_t t) { return a + (((b - a) * t) >> 8); }

float _sin(float a) {
  unsigned int i = (unsigned int)(a * 65536 / (M_PI*2));
  unsigned int frac = i & 0xff;
  i = (i >> 8) & 0xff;
  if (i < 64)
    return 0.0001f*lerp(sine_array[i], sine_array[i + 1], frac);
  else if(i < 128)
    return 0.0001f*lerp(sine_array[128 - i], sine_array[127 - i], frac);
  else if(i < 192)
    return -0.0001f*lerp(sine_array[-128 + i], sine_array[-127 + i], frac);
  else
    return -0.0001f*lerp(sine_array[256 - i], sine_array[255 - i], frac);
}

EDIT2: Alternative style. And even better precision, ±0.000161 using for(int i = 0; i <= 64; i++) fprintf(file, "%i,", (int)(sin((i * M_PI/2) / 64) * 32768 + 0.5)); and ldexp.

uint16_t sine_array[65] = {0,804,1608,2411,3212,4011,4808,5602,6393,7180,7962,8740,9512,10279,11039,11793,12540,13279,14010,14733,15447,16151,16846,17531,18205,18868,19520,20160,20788,21403,22006,22595,23170,23732,24279,24812,25330,25833,26320,26791,27246,27684,28106,28511,28899,29269,29622,29957,30274,30572,30853,31114,31357,31581,31786,31972,32138,32286,32413,32522,32610,32679,32729,32758,32768};

float _sin(float angle) {
  int i = (int)(angle * 65536 / (M_PI*2));
  int a, b, frac = i & 0xff;
  i = (i >> 8) & 0xff;
  if (i < 64)
    a = sine_array[i], b = sine_array[i + 1];
  else if(i < 128)
    a = sine_array[128 - i], b = sine_array[127 - i];
  else if(i < 192)
    a = -sine_array[-128 + i], b = -sine_array[-127 + i];
  else
    a = -sine_array[256 - i], b = -sine_array[255 - i];
  return ldexp((a + (((b - a) * frac) >> 8)), -15);
}

EDIT3: lol, even an 8+1 entry table matches the precision of the original, ±0.00489. I’ll go with 16+1, precision ±0.00129. Maybe now we’ll have space for _sincos.

uint16_t sine_array[17] = {0,3212,6393,9512,12540,15447,18205,20788,23170,25330,27246,28899,30274,31357,32138,32610,32768};

float _sin(float angle) {
  int i = (int)(angle * 4*16*256 / (M_PI*2)); // 4 quadrants, 16 points, 256 fraction
  int a, b, frac = i & 0xff;
  i = (i >> 8) & 0x3f;
  if (i < 16)
    a = sine_array[i], b = sine_array[i + 1];
  else if(i < 32)
    a = sine_array[32 - i], b = sine_array[31 - i];
  else if(i < 48)
    a = -sine_array[-32 + i], b = -sine_array[-31 + i];
  else
    a = -sine_array[64 - i], b = -sine_array[63 - i];
  return ldexp((a + (((b - a) * frac) >> 8)), -15);
}

float _sincos(float angle, float *s, float *c) {
  int i = (int)(angle * 4*16*256 / (M_PI*2));
  int sa, sb, ca, cb, frac = i & 0xff;
  i = (i >> 8) & 0x3f;
  if (i < 16) {
    sa = sine_array[i], sb = sine_array[i + 1];
    ca = sine_array[16 - i], cb = sine_array[15 - i];
  }
  else if(i < 32) {
    sa = sine_array[32 - i], sb = sine_array[31 - i];
    ca = -sine_array[-16 + i], cb = -sine_array[-15 + i];
  }
  else if(i < 48) {
    sa = -sine_array[-32 + i], sb = -sine_array[-31 + i];
    ca = -sine_array[48 - i], cb = -sine_array[47 - i];
  }
  else {
    sa = -sine_array[64 - i], sb = -sine_array[63 - i];
    ca = sine_array[-48 + i], cb = sine_array[-47 + i];
  }
  *s = ldexp((sa + (((sb - sa) * frac) >> 8)), -15);
  *c = ldexp((ca + (((cb - ca) * frac) >> 8)), -15);
}