Embedded World 2023 - STM32 CORDIC CO-PROCESSOR

robca · April 26, 2023, 11:17pm

I’d love to help, please. I discovered a long time ago that I’m not a particularly good architect, but when it comes to debugging and optimizing embedded systems, usually I can get good results

One thing you can quickly try, is to use arm_float_to_q31 () and arm_q31_to_float() to convert values, together with -O3 optimization (just add include “arm_math.h” to your code, it should pick up the optimized lib file for your processor if everything is set up correctly).

More than anything, I’d like to understand if it’s even worth looking at optimizing the math in the support files. Not knowing the code yet and not having run a profiler, it’s hard for me to understand if the math used consumes a significant percent of the processor cycles, or if it’s small enough not to be worth looking into. In the past I wanted to optimize everything, but sometimes even improving a function call by 50%, resulted in no real improvement in the overall code.

If it’s worth, I’d like to look at ways to create an optimized version of the math code for ARM and even for the STM2G4 (if really worth it). CORDIC is great when it can be run asynchronously, which doesn’t seem to be the case here (unless there is a way to pre-calculate the next values while the processor is sending the current data thru the GPIOs. If so, CORDIC+DMA math would use no processor cycles, just fill the right variables “magically” while the processor sends GPIO data)

Candas1 · April 27, 2023, 6:40am

This foc implementation uses q31

Juan-Antonio_Soren_E · April 27, 2023, 7:09am

Neat!

I see they use;

// inverse Park transformation
arm_inv_park_q31

We live, we learn. Didn’t know that existed

I’m in no way against SimpleFOC implementation, it is truly a good learning platform. Sometimes though, the “Simple” way of doing things, is to do it dedicated for the specific MCU. Once we look behind the looking glass, it becomes clear that it is actually quite non-trivial to support this many MCUs. I think we will se lots of projects using SFOC as a starting point, morphing into increasingly optimized and dedicated approaches.

I see there is also a float approach.

https://arm-software.github.io/CMSIS_5/DSP/html/group__inv__park.html

I mean, this is a simple way to set PWM duty’s

TIM1->CCR1

robca · April 27, 2023, 4:24pm

Yes, I mentioned that in my previous message. The CMSIS DSP library has support for various transforms, including Clarke and Park, and their inverse, in Q31 and F32 float formats.

Even if the CMSIS library provides the C code used for implementation, in reality the linker uses a precompiled lib file (libarm_cortexMxx_math.a), where Mxx is the processor used, M0, M3 or M4 and its endianness. Those libraries are hand optimized and much faster than the generic C math libraries for the same functions. And, unlike CORDIC, those work for any ARM core, which makes them much more generic. Even for M0 and M3 cores, with no FP unit, the CMSIS calls are still faster than the generic libraries (not as dramatic, though).

Q31 math using the M4 FP unit is much faster than F32, but in most cases both are supported. E.g. there is a arm_sin_f32 (float32_t x) and a arm_sin_q31 (q31_t x)

runger · April 27, 2023, 9:04pm

Hey, thanks for your interest and sharing your knowledge on this topic

I did actually time the arm_sin_f32 method to compare it, and the lookup table is still better. You’ll find the results if you scroll upwards a bit.

I’ve made the _sin() function that SimpleFOC uses overrideable (weakly bound) so now everyone can use their favorite MCU optimized method if they want to.

But in terms of the project, we aim to be cross platform, supporting as many of the commonly used Arduino MCUs as we can.
So we’d prefer to avoid ARM specific code where we can…

robca · April 27, 2023, 11:52pm

That’s what I assumed, thanks for the info.

I was wondering, though, if you’d be open to having contributions or examples for a specific family of processors. Let’s suspend disbelief and assume that it’s possible to, say, improve math performance on ARM by 50%. In a case like that, how would one contribute code that runs either with an #ifdef or replacing a weak implementation? I mean, I know I could fork the project, but I’m talking about a contribution to the project akin to the SimpleFOC drivers, which are by definition specific to a device/processor (similar to Arduino-FOC-drivers/src/encoders/stm32hwencoder at master · simplefoc/Arduino-FOC-drivers · GitHub).

While STM32G4xx code might be overly specialized, ARM covers quite a lot of Arduino boards.

Reason I’m asking, is that if there is an opportunity to contribute more broadly, I’d be looking to make the code more generic and documented. Otherwise, I’d just hack away on my own . And, to be clear, I have no idea if an improvement is even possible, just something I plan to look into on my own.

Juan-Antonio_Soren_E · April 28, 2023, 7:29am

Just testing the optimization build flags to see how it performs contra memory.

Both scenarios is using the CORDIC w. conversion from/to float.

Without specifying optimization flag:

RAM: [== ] 18.2% (used 5956 bytes from 32768 bytes)
Flash: [======= ] 60.9% (used 79772 bytes from 131072 bytes)

time per iterasion:   22.8924
time per iterasion:   22.8226
time per iterasion:   23.0469

With;

build_unflags = -Os
build_flags = 
    -O3

Timing with micros();

time per iterasion:   21.0512
time per iterasion:   21.0108
time per iterasion:   20.8874

RAM: [== ] 18.8% (used 6168 bytes from 32768 bytes)
Flash: [======= ] 72.3% (used 94708 bytes from 131072 bytes)

Not too bad. It does use some extra mem, but my code is bloated at the moment. Ill try to incorporate the arm_math.

Using SFOC implementation w. -03 flag:

time per iterasion:   19.0401
time per iterasion:   18.8373
time per iterasion:   19.2136

Juan-Antonio_Soren_E · April 28, 2023, 8:13am

If im not mistaken, we should be able to use the uint16_t encoder value with the arm_f16_to_q15 directly?

The CORDIC also takes 16bit arguments and gives 16bit results. Im not sure it can take 16bit arguments and output 32bit results.

Sorry, we need the electrical angle. Not the encoder angle.

Candas1 · April 28, 2023, 8:26am

As we are at it, you can even place some of the functions/code in ram instead of flash.
That might help on chips for which reading the flash is slower than the ram.
For example GD32F10RCT6 is a clone of STM32F103RCT6 but has 0 wait state for flash access.

So you can spend a lot of time optimizing for a given chip, but why do you need faster code if I may ask ?

Juan-Antonio_Soren_E · April 28, 2023, 8:28am

Just curios. Now that I was made aware of this arm_math.h lib im giving it a spin.

Juan-Antonio_Soren_E · April 28, 2023, 10:36am

There is a way to integrate the q31_t or q15_t from the setPhaseVoltageCORDIC(voltage.q, voltage.d, electrical_angle); and onwards, without wrapping the values, but it requires that the setpwm(); uses q31/q15 instead of floats. Since this returns q31: arm_inv_park_q31 (q31_t Id, q31_t Iq, q31_t *pIalpha, q31_t *pIbeta, q31_t sinVal, q31_t cosVal)

This is the CORDIC output without wrapping.

We can also convert the q31 value. Ill try that

Nope, wait a sec. ld and lq are not q31… hmm

Ok, I see now, that I previously made a mistake declaring the CORDIC result as a float. With the arm_math.h lib we can declare it as a q31_t. Ill try to compare the conversion to float recemented by ST and the arm_math.h one…

Edit: cant use the arm_math.h conversion. There is some mixup with const q31_t and q31_t I think..

Juan-Antonio_Soren_E · April 28, 2023, 11:32am

Ok,

Here is the output from:

arm_inv_park_f32 (0.4f, 0.2f, &alpha, &beta, value_f32_sine2, value_f32_cosine2);

Juan-Antonio_Soren_E · April 28, 2023, 11:45am

time per iterasion:   2.3153
time per iterasion:   2.3127
time per iterasion:   2.3192
time per iterasion:   2.3140

@robca

434.78Khz loop freq!

How can this be so slow, in comparison?

 Ualpha =  _ca * Ud - _sa * Uq;  // -sin(angle) * Uq;
 Ubeta =  _sa * Ud + _ca * Uq;    //  cos(angle) * Uq;

Sorry, my mistake. I tricked my self It was running open_loop…

time per iterasion:   21.3596
time per iterasion:   21.3598
time per iterasion:   21.3595
time per iterasion:   21.3596
time per iterasion:   21.3596

robca · April 30, 2023, 6:24pm

I finally found some time to look in to CORDIC unit and run some tests on my Nucleo-G474RE board (same as the G431 in the ESC, from a clock and CORDIC point of view). Interestingly enough, the sine and cosine functions are run at the same time, and with one execution both values can be read. CORDIC can be used in something called “zero overhead mode”, where the execution is started and if the code tries to read the result register before the result is ready, the processor execution is suspended until the result is ready. That means that there is no need to poll if the result is ready.

The other added benefit in zero overhead mode is that the processor can execute ~20 instructions between the time CORDIC starts and the results are ready with no impact on overall timing. Looking at some of the SimpleFOC functions using _sin and _cos, there are opportunities to leverage this (for example, if you look at my code, converting the cosine value before reading the second register, saves 20 usec for the loop)

I also got rid of the horrendous HAL coding (*) and used direct register access.

Reusing the code posted here before and calculating both _sin and _cos, here’s what it looks like now with default optimizations. Some improvement is still possible, I haven’t spent much time on it. Most of the time for CORDIC is spent converting back and forth from float. Even so, CORDIC is significantly faster and offers better precision

Starting...
Initializing CORDIC...
CORDIC initialized.

Timing CORDIC vs stdlib vs SimpleFOC Sine and Cosine calculations...

CORDIC:
CORDIC Time (us) for 3217 steps: 1351
Result: 2048.99

SimpleFOC _sin _cos:
SimpleFOC _sin _cos time (us) for 3217 steps: 2035
Result: 2048.97

stdlib sin:
stdlib sin time (us) for 3217 steps: 5940
Result: 2049.00

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000059
RMS difference between SimpleFOC and stdlib: 0.00161161
Test complete.

If some of the FOC transforms can be fully executed in Q31 math (which is relatively simple to implement), the CORDIC gains can be much bigger. With minimal back and forth q31<->float, CORDIC can be 2-3 times faster. At the cost of branching, though

It’s pretty clear, on the other hand, that the ARM math library offers no value compared to what SimpleFOC does, kudos on a well done optimization.

(*) HAL is the most pointless abstraction layer I ever saw. It’s bloated, wasteful and more importantly doesn’t really abstract most peripherals: the STM32 family offers many variations of the DAC, for example, and each processor has its own HAL abstraction, forcing you to change the code anyway. I hate HAL with a passion and use the LL libraries or direct register access whenever I can. CORDIC is especially easy to use, having 3 registers: one configuration that never changes, one to enter values, one to read results (the write register can be used once or twice, depending on the function, and the read register also once or twice based on the function)

Here’s the code (poorly written, I’m using global variables for the results, returning the values in a result structure would be best, but I just wanted to see what’s possible)

#include <Arduino.h>
#include <SimpleFOC.h>
#include "common/foc_utils.h"

#include "stm32g4xx_ll_cordic.h"
#include "stm32g4xx_ll_rcc.h"
#include "stm32g4xx_ll_bus.h"
#include "arm_math.h"

#define PI32f 3.141592f

float cordic_sin_value;
float cordic_cos_value;

void CORDIC_Config(void)
{
  LL_AHB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_CORDIC);

  /* Configure CORDIC peripheral */
  LL_CORDIC_Config(CORDIC, LL_CORDIC_FUNCTION_COSINE, /* cosine function */
                   LL_CORDIC_PRECISION_6CYCLES,       /* max precision for q1.31 cosine */
                   LL_CORDIC_SCALE_0,                 /* no scale */
                   LL_CORDIC_NBWRITE_1,               /* One input data: angle. Second input data (modulus) is 1 after cordic reset */
                   LL_CORDIC_NBREAD_2,                /* Two output data: cosine, then sine */
                   LL_CORDIC_INSIZE_32BITS,           /* q1.31 format for input data */
                   LL_CORDIC_OUTSIZE_32BITS);         /* q1.31 format for output data */
}

void cordic_calc(float angle)
{
  // convert angle flot to CORDICq31 format
  int32_t angle31 = (q31_t)((angle / PI32f) * 0x80000000);

  /* Write angle and start CORDIC execution */
  CORDIC->WDATA = angle31;

// code here can be executed in parallel with CORDIC with no impact on timing

  /* Read cosine */
  q31_t cosOutput = (int32_t)CORDIC->RDATA;

  // convert q31 result to float
  cordic_cos_value = (float)cosOutput / (float)0x80000000;

  /* Read sine */
  q31_t sinOutput = (int32_t)CORDIC->RDATA;

  // convert q31 results to float
  cordic_sin_value = (float)sinOutput / (float)0x80000000;
}

void setup()
{
  Serial.begin(115200);
  while (!Serial)
   ;
  delay(3000);
  Serial.println("Starting...");
  Serial.println("Initializing CORDIC...");
  CORDIC_Config();
  Serial.println("CORDIC initialized.");
  Serial.println();
  Serial.println();
}

void loop()
{

  Serial.println("Timing CORDIC vs stdlib vs SimpleFOC Sine and Cosine calculations...");
  Serial.println();

  Serial.println("CORDIC:");

  float step = 1 / 1024.0f;
  float res = 0.0;
  int steps = 0;
  long ts = micros();

  for (float i = 0.0; i < _PI; i += step)
  {
   cordic_calc(i);
   res += cordic_sin_value;
   res += cordic_cos_value;
   steps++;
  }
  long ts_end = micros();
  Serial.print("CORDIC Time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("SimpleFOC _sin _cos:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i += step)
  {
    res += _sin(i);
    res += _cos(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("SimpleFOC _sin _cos time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("stdlib sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0; i < _PI; i += step)
  {
    res += sin(i);
    res += cos(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("stdlib sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("Comparing accuracy...");
  float rmsdiff1 = 0.0f;
  float rmsdiff2 = 0.0f;
  steps = 0;
  for (float i = 0.0; i < _PI; i += step)
  {
    float diff1 = 0.0f;
    float diff2 = 0.0f;
    cordic_calc(i);
    float res1 = cordic_sin_value;
    float res2 = _sin(i);
    float res3 = sin(i);

    diff1 = res3 - res1;
    if (diff1 > 1.0)
    {
      Serial.print("CORDIC vs stdlib at i=");
      Serial.print(i, 8);
      Serial.print(": ");
      Serial.println(diff1, 8);
    }

    diff2 = res3 - res2;
    if (diff2 > 1.0)
    {
      Serial.print("SimFOC vs stdlib at i=");
      Serial.print(i, 8);
      Serial.print(": ");
      Serial.println(diff2, 8);
    }

    rmsdiff1 += diff1 * diff1;
    rmsdiff2 += diff2 * diff2;
    steps++;
  }
  rmsdiff1 = sqrt(rmsdiff1 / steps);
  rmsdiff2 = sqrt(rmsdiff2 / steps);
  Serial.print("RMS difference between CORDIC and stdlib: ");
  Serial.println(rmsdiff1, 8);
  Serial.print("RMS difference between SimpleFOC and stdlib: ");
  Serial.println(rmsdiff2, 8);

  Serial.println("Test complete.");
  while (1)
    ;
}

runger · April 30, 2023, 7:07pm

Hey @robca ,

Thanks so much for trying this out, and for your interest in this

The version of SimpleFOC _sin() currently in the library is actually still the older version, not the one optimised by Deku (mainly) and myself.

You could try this version if you wanted:

unsigned short sine_array3[129] = {0, 402, 804, 1206, 1608, 2009, 2411, 2811, 3212, 3612, 4011, 4410, 4808, 5205, 5602, 5998, 6393, 6787, 7180, 7571, 7962, 8351, 8740, 9127, 9512, 9896, 10279, 10660, 11039, 11417, 11793, 12167, 12540, 12910, 13279, 13646, 14010, 14373, 14733, 15091, 15447, 15800, 16151, 16500, 16846, 17190, 17531, 17869, 18205, 18538, 18868, 19195, 19520, 19841, 20160, 20475, 20788, 21097, 21403, 21706, 22006, 22302, 22595, 22884, 23170, 23453, 23732, 24008, 24279, 24548, 24812, 25073, 25330, 25583, 25833, 26078, 26320, 26557, 26791, 27020, 27246, 27467, 27684, 27897, 28106, 28311, 28511, 28707, 28899, 29086, 29269, 29448, 29622, 29792, 29957, 30118, 30274, 30425, 30572, 30715, 30853, 30986, 31114, 31238, 31357, 31471, 31581, 31686, 31786, 31881, 31972, 32058, 32138, 32214, 32286, 32352, 32413, 32470, 32522, 32568, 32610, 32647, 32679, 32706, 32729, 32746, 32758, 32766, 32768};
float deku_sin129(float a) {
  unsigned int i = ((unsigned int)(a * (128*8 /_2PI) + 1) >> 1) & 0x1ff;
  if (i < 128) {
    return (1/32768.0f)*sine_array3[i];
  }
  else if(i < 256) {
    return (1/32768.0f)*sine_array3[256 - i];
  }
  else if(i < 384) {
    return -(1/32768.0f)*sine_array3[-256 + i];
  }
  else {
    return -(1/32768.0f)*sine_array3[512 - i];
  }
}

I’m planning to replace the current version with the optimised one in an upcoming release, but I want to test it on a few more MCU types first.

Regarding the questions from your previous post:

I think we’d be very open to that, and very happy to see MCU-specific optimisations, as long as we can keep them out of the main library, for the reasons already mentioned.

But there are several ways we can make such optimisations available to people without major impact on the code of the main library:

like for the _sin() function, we can make things weakly bound so users can bring their own implementations of certain functions
much of the code is C++, so we can introduce virtual functions and sub-classes, so to make a STM32BLDCMotor class, for example.
we can create alternative, optimised implementations of certain sensors, or other drivers, if needed.
such classes can find a home in the SimpleFOC Drivers library - which is intended for hardware specific code

We would try to make such contributions available, for sure - but as mentioned, in a way that is optional for the users and doesn’t complicate the main codebase.

I think if you pick a MCU family, then very significant improvement is possible, really!

robca · April 30, 2023, 8:42pm

I just tried, after slightly modifying cordic_calc to return both values [void cordic_calc(float angle, result_t * output)] thus improving the total execution time from 1351usec to 1196

Deku runs in 1562, which is a significant improvement compared to the existing one, but still significantly slower than CORDIC. And, unless I did something wrong, deku_sin129() seems to have a slightly larger error the the current implementation, making CORDIC much more precise. Another minor advantage, is that CORDIC works from -PI to PI, the SimpleFOC ones only from 0 to PI, which could help with minimizing the need for normalization is some cases (?)

Starting...
Initializing CORDIC...
CORDIC initialized.

Timing CORDIC vs stdlib vs SimpleFOC vs Deku Sine and Cosine calculations...

CORDIC:
CORDIC Time (us) for 3217 steps: 1196
Result: 2048.99

SimpleFOC _sin _cos:
SimpleFOC _sin _cos time (us) for 3217 steps: 2112
Result: 2048.97

Deku _sin _cos:
Deku _sin _cos time (us) for 3217 steps: 1562
Result: 2048.98

stdlib sin:
stdlib sin time (us) for 3217 steps: 5364
Result: 2049.00

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000059
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku129 and stdlib: 0.00250501
Test complete.

Once again, I’m running both sin and cos at once and I implemented a deku_cos129() to use the same code for both sin and cos

Thanks for everything else in your reply. Very good points, and if we truly wanted to improve performance when CORDIC is available, probably implementing a custom STM32BLDCMotor class seems the best option, allowing to run some code in parallel with CORDIC, most of the math in q31 and avoiding conversions (e.g. the transforms can be calculated in q31 with no need to convert the sin and cos values back to float)

Adding the code I ran for this test

unsigned short sine_array3[129] = {0, 402, 804, 1206, 1608, 2009, 2411, 2811, 3212, 3612, 4011, 4410, 4808, 5205, 5602, 5998, 6393, 6787, 7180, 7571, 7962, 8351, 8740, 9127, 9512, 9896, 10279, 10660, 11039, 11417, 11793, 12167, 12540, 12910, 13279, 13646, 14010, 14373, 14733, 15091, 15447, 15800, 16151, 16500, 16846, 17190, 17531, 17869, 18205, 18538, 18868, 19195, 19520, 19841, 20160, 20475, 20788, 21097, 21403, 21706, 22006, 22302, 22595, 22884, 23170, 23453, 23732, 24008, 24279, 24548, 24812, 25073, 25330, 25583, 25833, 26078, 26320, 26557, 26791, 27020, 27246, 27467, 27684, 27897, 28106, 28311, 28511, 28707, 28899, 29086, 29269, 29448, 29622, 29792, 29957, 30118, 30274, 30425, 30572, 30715, 30853, 30986, 31114, 31238, 31357, 31471, 31581, 31686, 31786, 31881, 31972, 32058, 32138, 32214, 32286, 32352, 32413, 32470, 32522, 32568, 32610, 32647, 32679, 32706, 32729, 32746, 32758, 32766, 32768};
float deku_sin129(float a)
{
  unsigned int i = ((unsigned int)(a * (128 * 8 / _2PI) + 1) >> 1) & 0x1ff;
  if (i < 128)
  {
    return (1 / 32768.0f) * sine_array3[i];
  }
  else if (i < 256)
  {
    return (1 / 32768.0f) * sine_array3[256 - i];
  }
  else if (i < 384)
  {
    return -(1 / 32768.0f) * sine_array3[-256 + i];
  }
  else
  {
    return -(1 / 32768.0f) * sine_array3[512 - i];
  }
}

float deku_cos129(float a)
{
  float a_sin = a + _PI_2;
  a_sin = a_sin > _2PI ? a_sin - _2PI : a_sin;
  return deku_sin129(a_sin);
}

#include <Arduino.h>
#include <SimpleFOC.h>
#include "common/foc_utils.h"

#include "stm32g4xx_ll_cordic.h"
#include "stm32g4xx_ll_rcc.h"
#include "stm32g4xx_ll_bus.h"
#include "arm_math.h"

#define PI32f 3.141592f   // max precision in (float) due to implementation

typedef struct results
{
  float sin;
  float cos;
} result_t;

void CORDIC_Config(void)
{
  LL_AHB1_GRP1_EnableClock(LL_AHB1_GRP1_PERIPH_CORDIC);

  /* Configure CORDIC peripheral */
  LL_CORDIC_Config(CORDIC, LL_CORDIC_FUNCTION_COSINE, /* cosine function */
                   LL_CORDIC_PRECISION_6CYCLES,       /* max precision for q1.31 cosine */
                   LL_CORDIC_SCALE_0,                 /* no scale */
                   LL_CORDIC_NBWRITE_1,               /* One input data: angle. Second input data (modulus) is 1 after cordic reset */
                   LL_CORDIC_NBREAD_2,                /* Two output data: cosine, then sine */
                   LL_CORDIC_INSIZE_32BITS,           /* q1.31 format for input data */
                   LL_CORDIC_OUTSIZE_32BITS);         /* q1.31 format for output data */
}

void cordic_calc(float angle, result_t * output)
{
  /* Write angle and start CORDIC execution */
  CORDIC->WDATA = (q31_t)((angle / PI32f) * 0x80000000);

  // code here can be executed in parallel with CORDIC with no impact on timing

  /* Read cosine */
  q31_t cosOutput = (int32_t)CORDIC->RDATA;

  // convert q31 result to float
  output->cos = (float)cosOutput / (float)0x80000000;

  /* Read sine */
  q31_t sinOutput = (int32_t)CORDIC->RDATA;

  // convert q31 results to float
  output->sin = (float)sinOutput / (float)0x80000000;
}

void setup()
{
  Serial.begin(115200);
  while (!Serial)
   ;
  delay(1000);
  Serial.println("Starting...");
  Serial.print("Initializing CORDIC...    ");
  CORDIC_Config();
  Serial.println("CORDIC initialized.");
  Serial.println();
}

void loop()
{
  result_t cordic;
  Serial.println("Timing CORDIC vs stdlib vs SimpleFOC vs Deku Sine and Cosine calculations...");
  Serial.println();

  Serial.println("CORDIC:");

  float step = 1 / 1024.0f;
  float res = 0.0;
  int steps = 0;
  long ts = micros();

  for (float i = 0.0f; i < _PI; i += step)
  {
   cordic_calc(i, &cordic);
   res += cordic.sin;
   res += cordic.cos;
   steps++;
  }
  long ts_end = micros();
  Serial.print("CORDIC Time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("SimpleFOC _sin _cos:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0f; i < _PI; i += step)
  {
    res += _sin(i);
    res += _cos(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("SimpleFOC _sin _cos time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("Deku _sin _cos:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0f; i < _PI; i += step)
  {
    res += deku_sin129(i);
    res += deku_cos129(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("Deku _sin _cos time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("stdlib sin:");
  steps = 0;
  res = 0.0f;
  ts = micros();
  for (float i = 0.0f; i < _PI; i += step)
  {
    res += sin(i);
    res += cos(i);
    steps++;
  }
  ts_end = micros();
  Serial.print("stdlib sin time (us) for ");
  Serial.print(steps);
  Serial.print(" steps: ");
  Serial.println(ts_end - ts);
  Serial.print("Result: ");
  Serial.println(res);

  Serial.println();
  Serial.println("Comparing accuracy...");
  float rmsdiff1 = 0.0f;
  float rmsdiff2 = 0.0f;
  float rmsdiff3 = 0.0f;
  steps = 0;
  for (float i = 0.0f; i < _PI; i += step)
  {
    float diff1 = 0.0f;
    float diff2 = 0.0f;
    float diff3 = 0.0f;
    cordic_calc(i, &cordic);
    float res1 = cordic.sin;
    float res2 = _sin(i);
    float res3 = sin(i);
    float res4 = deku_sin129(i);

    diff1 = res3 - res1;
    if (diff1 > 1.0)
    {
      Serial.print("CORDIC vs stdlib at i=");
      Serial.print(i, 8);
      Serial.print(": ");
      Serial.println(diff1, 8);
    }

    diff2 = res3 - res2;
    if (diff2 > 1.0)
    {
      Serial.print("SimFOC vs stdlib at i=");
      Serial.print(i, 8);
      Serial.print(": ");
      Serial.println(diff2, 8);
    }

    diff3 = res3 - res4;
    if (diff2 > 1.0)
    {
      Serial.print("Deku vs stdlib at i=");
      Serial.print(i, 8);
      Serial.print(": ");
      Serial.println(diff2, 8);
    }

    rmsdiff1 += diff1 * diff1;
    rmsdiff2 += diff2 * diff2;
    rmsdiff3 += diff3 * diff3;
    steps++;
  }
  rmsdiff1 = sqrt(rmsdiff1 / steps);
  rmsdiff2 = sqrt(rmsdiff2 / steps);
  rmsdiff3 = sqrt(rmsdiff3 / steps);
  Serial.print("RMS difference between CORDIC and stdlib: ");
  Serial.println(rmsdiff1, 8);
  Serial.print("RMS difference between SimpleFOC and stdlib: ");
  Serial.println(rmsdiff2, 8);
  Serial.print("RMS difference between Deku129 and stdlib: ");
  Serial.println(rmsdiff3, 8);

  Serial.println("Test complete.");
  while (1)
    ;
}

robca · April 30, 2023, 9:50pm

One more quick note. I recompiled using -Ofast, and here’s what I see: CORDIC 594 usec, current SimpleFOC 1988, Deku_sin 1606. ~2.7 times faster than deku_sin(&cos)

I was sure that -Ofast would optimize the float conversion, but did not expect such a dramatic impact. If anything, I expected -Ofast to benefit more the functions with more code (-Ofast does nothing for the CORDIC itself)

And, yes, surprisingly -Ofast slows down deku_sin by a trivial amount, while it speeds up the current implementation. I’s not the first time that I see hand optimized code perform slightly worse when optimizations are turned on.

Starting...
Initializing CORDIC...    CORDIC initialized.

Timing CORDIC vs stdlib vs SimpleFOC vs Deku Sine and Cosine calculations...

CORDIC:
CORDIC Time (us) for 3217 steps: 594
Result: 2048.99

SimpleFOC _sin _cos:
SimpleFOC _sin _cos time (us) for 3217 steps: 1988
Result: 2048.96

Deku _sin _cos:
Deku _sin _cos time (us) for 3217 steps: 1606
Result: 2048.97

stdlib sin:
stdlib sin time (us) for 3217 steps: 5712
Result: 2048.99

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000062
RMS difference between SimpleFOC and stdlib: 0.00161161
RMS difference between Deku129 and stdlib: 0.00250501
Test complete.

if you want to repro (I might very likely have messed something up), add the following to your platformio.ini for the project

build_unflags = -Os
build_flags = -Ofast

dekutree64 · April 30, 2023, 10:10pm

Interesting! So Runger’s bad CORDIC performance was only from HAL library bloat, not from the hardware itself?

FYI, deku_cos129 can return deku_sin129(a + _PI_2); since the wraparound is handled internally.

It’s no surprise that the accuracy is lower on that one since it’s just a 129 entry lookup table versus the 200 entry table of SimpleFOC _sin. The 257 version is better, and only causes flash space trouble on Uno. The interpolated version gives much higher precision with less flash space, at the cost of taking 1.5-2x as long.

In addition to RMS error, maximum error is also worth measuring. The comment in foc_utils.cpp says the precision of SimpleFOC _sin is ±0.005, and my non-interpolated 257 entry lookup was ±0.003108 when I measured it against stdlib.

Another thing to note is that you’re doing a single call to cordic_calc, so you should compete it against the _sincos functions I posted earlier in the thread. With the 257 non-interpolated lookup from post #78, it should be more or less the same time as CORDIC.

It certainly is frustrating that we can’t do super speed optimization without complicating the codebase. I’m half tempted to create a new project for highly optimized STM32-specific FOC with fixed-point angles and Q31 math. But on the other hand it works well enough as it is, and if we want better performance we can just spend a couple bucks more on faster MCUs and brute force it.

robca · April 30, 2023, 10:45pm

I have to admit I haven’t studied all the posts on this thread. HAL definitely adds a ton of un-needed overhead especially for something as simple as CORDIC. Might be worth for something as complex as the USB virtual serial port…

I also optimized the q31 ↔ float conversions to take advantage of the FPU optimization. Considering that CORDIC runs in ~10 to 20 CPU clock cycles, all the crud added by HAL adds way more cycles than it saves

Good point! After all, sin and cos are both needed at the same time. As I said, I did not read every single message, sorry.

Assuming I have done things right, I have replaced the SimpleFOC implementation with the sincos from post #78 (and used -Ofast). The new _sincos() is much faster than deku129, but still almost twice as slow as CORDIC

Timing CORDIC vs stdlib vs sincos vs Deku Sine and Cosine calculations...

CORDIC:
CORDIC Time (us) for 3217 steps: 594
Result: 2048.99

sincos _sin _cos:
sincos _sin _cos time (us) for 3217 steps: 1102
Result: 2048.97

Deku _sin _cos:
Deku _sin _cos time (us) for 3217 steps: 1607
Result: 2048.97

stdlib sin:
stdlib sin time (us) for 3217 steps: 5665
Result: 2048.99

Comparing accuracy...
RMS difference between CORDIC and stdlib: 0.00000062
RMS difference between sincos and stdlib: 0.00125244
RMS difference between Deku129 and stdlib: 0.00250501
Test complete.

runger · May 1, 2023, 12:29am

Maybe HAL to some degree, but I think the difference is also the use of “zero overhead mode”, although I thought I had tried this also. So maybe it is HAL.
What surprises me is how simple the code is without HAL - it doesn’t really make sense to use it when the non-HAL code is so streightforward.

That’s pretty impressive performance now.

We should put the CORDIC code in the examples for people who have such MCUs to use it.

@robca Is there any reason you’re using a struct rather than a float[2] as the result holding type?