Embedded World 2023 - STM32 CORDIC CO-PROCESSOR

No idea :slight_smile:

Sometimes my brain works in weird ways, I must have been thinking about returning two values in one variable, and ended up with a struct. The code runs equally well using void cordic_calc(float angle, float * csin, float * ccos) or any other similar variation. I’m glad to see you find this useful, please feel free to implement it any way you like in line with the SimpleFOC coding conventions (happy to test/benchmark it)

Rant follows
As for HAL, I’m an old school embedded programmer, used to write and read registers. But I like well written abstractions like the nRF52 SDK ones: not as fast as direct access, but lightweight, logical and easier to use than direct access. For the STM32 HAL, the interesting thing is that for almost everything, accessing the registers directly is much faster and cleaner code than going thru the HAL. Sometimes you follow a chain of HAL calls to discover that after setting and changing a lot of internal HAL variables and doing a lot of pointless checks, it only does a single register read or write. And, as I said, the HAL is implemented differently for different versions of the same peripheral (say, ADC or timer), forcing changes to the code anyway when moving to a similar processor.

But of all the STM32 peripherals, CORDIC must be the simplest one to manage directly and the one where HAL makes no sense at all. Even using the DMA, it’s easier to write directly to the DMA and CORDIC than use the HAL. I tried using the DMA, btw, but it’s slower than zero overhead mode for a single calculation. Only when processing a vector and running code independently you see some gain. I don’t think that there’s anything in SimpleFOC I have seen where using DMA/CORDIC would be faster than direct register access and zero overhead

3 Likes

@robca

Just ran the -Ofast, it does actually give a significant performance increase.

Timing FOCloop w. micros();

time per iterasion:   17.8392
time per iterasion:   17.8374
time per iterasion:   17.8371
time per iterasion:   17.8372

We are now at 56.2 Khz.

Note: The timing is done by averaging 10.000 iterations. It is running closed loop using the Hardware_encoder interface on 16bit timer and velocity is obtained by another timer running System Clock (168 Mhz), capturing encoder pulses on one of the encoder inputs.

 int count = 0;
 start_ticks = micros(); 
 for (int i=0;i<10000;i++)
{
 
  // main FOC algorithm function
  motor.loopFOC();

  // Motion control function
  motor.move(target_velocity);

  commander.run();
 
  count++;

}

stop_ticks = micros(); 
elapsed_ticks = stop_ticks-start_ticks;
Serial.print("time per iterasion:   ");
Serial.println(elapsed_ticks / count, 4);

Here is the Old SFOC implementation w. -Ofast build_flag. Will try to compare the new optimized one later.

time per iterasion:   17.3991
time per iterasion:   17.3952
time per iterasion:   17.3643
time per iterasion:   17.3493

Hay,

Im puzzled to what this does? Is this the FPU optimization you mentioned?

It looks like this is just for enabling peripherals. [66.1]
https://www.st.com/resource/en/user_manual/um2570-description-of-stm32g4-hal-and-lowlayer-drivers--stmicroelectronics.pdf

Something not right with your conversion to q31_t

This works:

// convert angle flot to CORDICq31 format
  uint32_t angle31 = (uint32_t)(angle2 * (1UL << 31) / (1.0f * PI));

I like the rest though. Now the wrapping is seemingly not needed. Perhaps using the CORDIC COSINUS mode ?

Awesome!

time per iterasion:   16.9273
time per iterasion:   16.9274
time per iterasion:   16.8856
time per iterasion:   16.8800

As @VIPQualityPost mentions, it’s the header file with all the STM32G4 family memory mapped peripherals. In the STM32 world, every peripheral is memory mapped to a specific address. To access its registers, you read and write from the right memory address. That would get confusing really quickly, so the header file assigns mnemonics to peripherals and registers. I needed that file to enable the CORDIC clock (i.e. enable it, by default most peripherals are in disabled mode at boot) and to be able to have CORDIC>RDATA

Im converting my CNC build to PNP. I know total overkill. The goal is to make it fast to switch between PNP and CNC.

My conversion works only for -PI to PI, but the SimpleFOC code assumes 0 to 2PI. I said I was just testing the concept. If you look at my code, it only uses 0 to PI, where everything works and produces the same float results for sine and cosine as the standard C libraries. In the final version, it would be best to use q31 everywhere for a real speedup. And the SimpleFOC code calls a normalize function to ensure that the angle is between 0 and PI, if we only wanted to use the CORDIC trig functions without using Q31 everywhere, it would be best to write an inline normalize_q31(float angle) that takes a float angle in any range, and converts it at once to a q31 between -PI and PI

Please note that using the predefined “PI”, at least in my case, slowed the code significantly. PI is usually defined at a double (64 bit). That’s why I added the 32 bit float version #define PI32f 3.141592f (in float 32, that’s the highest precision number representing PI, additional digits are lost). By using a 32 bit float version of PI, the code was much faster.

Please note that CORDIC uses q31_t which is a int32_t not uint32_t. I know that the code I wrote works between 0 and PI. Using signed int32 is important when using q31 math, even if it doesn’t make much of a difference in this conversion

Wow yes, that I must say. This is with your PI32f definition →

time per iterasion:   14.5107
time per iterasion:   14.5110
time per iterasion:   14.5109

I was using pi = 3.1415926535897932384626433832795, as I thought it would increase precision.

Now we´re getting somewhere…

My conversion works only for -PI to PI, but the SimpleFOC code assumes 0 to 2PI.

true… Maybe we can break the 10 micros() mark by using pure q31_t and by placing some of the code in the section where the CORDIC is calculating.

We are now in the 70 kHz vicinity

Note: Another thing to consider is, the Encoder output is 16bit, so perhaps it makes more sense to feed the CORDIC q15_t.

The test code went from 0 to PI because that was the common input range of all the different functions being tested :slight_smile:

SimpleFOC currently needs the normalizeAngle before calling the trig functions, but the aim is to eliminate this requirement with the switch to the optimised lookup table version of _sin().

The SimpleFOC codebase should be using _PI, which is defined as a float.

We’re really getting somewhere with the STM32 optimisations :slight_smile:
But keep in mind that running the FOC loop any faster than the PWM frequency is without actual benefit, since the output rate is limited by the PWM frequency. You also can’t exceed the sensor’s bandwidth / sample rate since it is a required input.

So the benefit of optimising the FOC loop beyond the PWM frequency is to have more time to do other stuff on the MCU in parallel to running the motor. This could be things like cogging compensation, voltage measurement, calculating averaged outputs or something else. But just running FOC in a tight loop won’t improve things, in fact it could also make things worse.

Yes, I knew I was cheating a bit, and I needed to figure out how to normalize angles between -PI and PI, as CORDIC requires. But I wanted to quickly share the fastest way to use CORDIC (zero overhead and direct access).

Speaking of which, if I wanted to write a better conversion/normalization, what should I consider are reasonable values of angle?

I see that _sincos() can only handle positive angles (of any magnitude), no negative, is that by design? Realistically, what would be the max value for angle in SimpleFOC?

Yeah, I saw it after I played with the code a bit, I missed it at the beginning. And just to add some pointless trivia, there is a limit on how many digits can be represented with float. And the best representation for PI as a float is 3.141592 (Embedded Wednesdays: Floating Point Numbers — Embedded). It doesn’t hurt to define it with more digits, but it doesn’t change the actual precision. You can see it by repeatedly adding 0.0000001 to _PI, it stays the same until it jumps to 3.141593.

Yes, and that’s what I was asking when I joined this thread. I just enjoy optimizing code, though, and couldn’t resist trying :slight_smile: Plus I wanted to learn about CORDIC, and there’s nothing like trying to solve an actual problem to help learn. The nice thing about using CORDIC is that it can be run in parallel with other stuff (for only a few processor cycles), and if needed it can free up time for other stuff. Thanks for all the additional info provided.

No, that’s a bug. Forgot it has to be masked again after the offset for cosine.

Good point. This is where the motion profiling comes in. I will say a good benchmark for the stepper scenario is 50 kHz PWM & FOCloop with the USB properly initialized and some basic motion handling.

I performed the test with 38 kHz PWM frequency.

This is all very dependent on the FETs thermal performance.

One such motion planner could be similar to this →

It is inspired by this paper.

Moreover, the very low calculation time (less than 1 microsecond) makes it possible to easily control a multi-DOF system during one control cycle (classically about 1 millisecond), while preserving time for other computer processing.

Do you se a potential use case with SFOC ? Does the MIT license prohibit use if we honor the authors ? Maybe it’s possible to integrate it in a novel way…

@robca can this be placed in the section where the CORDIC does its thing ? By the way its written, it does look a lot like how SFOC drivers are build.

ODrive/trapTraj.cpp at master · odriverobotics/ODrive (github.com)

Not sure the TIME-OPTIMAL JL TRAJECTORY is implemented here. That could be a future dev. goal.

@Antun_Skuric

Are you working on the feed forward concept / motion planning?

Looking @ODrive implementation, they use a combination of velocity, position and torque →

   case INPUT_MODE_TRAP_TRAJ: {
            if(input_pos_updated_){
                move_to_pos(input_pos_);
                input_pos_updated_ = false;
            }
            // Avoid updating uninitialized trajectory
            if (trajectory_done_)
                break;
            
            if (axis_->trap_traj_.t_ > axis_->trap_traj_.Tf_) {
                // Drop into position control mode when done to avoid problems on loop counter delta overflow
                config_.control_mode = CONTROL_MODE_POSITION_CONTROL;
                pos_setpoint_ = axis_->trap_traj_.Xf_;
                vel_setpoint_ = 0.0f;
                torque_setpoint_ = 0.0f;
                trajectory_done_ = true;
            } else {
                TrapezoidalTrajectory::Step_t traj_step = axis_->trap_traj_.eval(axis_->trap_traj_.t_);
                pos_setpoint_ = traj_step.Y;
                vel_setpoint_ = traj_step.Yd;
                torque_setpoint_ = traj_step.Ydd * config_.inertia;
                axis_->trap_traj_.t_ += current_meas_period;
            }

Kinda like a real nice mix…

Yes, but… I measure CORDIC execution times using the DWT->CYCCNT cycle counter I mentioned before. It takes 33 processor cycles (at 170MHz) for the cordic_calc() function to execute, roughly 20 of which are spent waiting for the CORDIC execution. The rest are conversions, even if in the function I shared here, the first q31->float result conversion is already happening while waiting for the second CORDIC result. Keep in mind that all of the above is compiling with -Ofast, which could reorder some of the executed so that it’s not included in the timed portion. But, order of magnitude, CORDIC sin & cos takes ~25 cycles.

So by running code in parallel with CORDIC, you can only execute roughly 20-25 machine language instructions “for free”, as a mentioned before. It’s not nothing, but it’s not massive either. If the code also uses other operations that can be executed by CORDIC (e.g. sqrt, exp, atan, log, etc), CORDIC execution can be pipelined in zero overhead mode, and gain further parallel execution. Especially if all the math can be executed in q31 with conversions only on entry and exit.

1 Like

Right, we will have to time the trapTraj.cpp in any case.

@JorgeMaker I tried your trapezoid driver but the motor just spin out of control. Does it use the same approach as the paper above ?

It does look similar in some ways. I’ll test it again… is it in the repo w. Examples?

How do you declare the limits? In the snip you provided, really nice work btw., there is no limits set ?

H @Juan-Antonio_Soren_E i, what I implemented is based on what @jlauer implemented that is based on the following:

Ok, just setting limits like this →

  int plannerPeriod = 1; // 1000 / this number = Hz, i.e. 1000 / 100 = 10Hz, 1000 / 10 = 100Hz, 1000 / 5 = 200Hz, 1000 / 1 = 1000hZ
    float Vmax_ = 30.0f;    // # Velocity max (rads/s)
    float Amax_ = 10.0f;    // # Acceleration max (rads/s/s)
    float Dmax_ = 10.0f;    // # Decelerations max (rads/s/s)

Should it be in velocity or angle mode ?

Is this the way to use it ?


  motor.loopFOC();

  // Motion control function
  motor.move(target);

  commander.run();

  planner.runPlannerOnTick();

It is intended to work with angle mode both open and closed loop.

Yes this is what you have to place at the loop function.

Do not forget to add:

void doPlanner(char *cmd){
planner.doTrapezoidalPlannerCommand(cmd);
}

and this code in your setup:

//  GCode move Gxx, GVxx, or GAxx - Example: G30 moves to position in rads. 
//  GV10 sets velocity to 10 rads/s. GA5 sets acceleration to 5 rads/s/s.");
planner.linkMotor(&motor);
commander.add('G', doPlanner, "Motion Planner");
1 Like