Angle Normalization Code - really slow!

Hey all! I’ve been optimizing my control loop and found that the code normalizing the (electrical) angle is ridiculously slow:

// normalizing radian angle to [0,2PI]
__attribute__((weak)) float _normalizeAngle(float angle){
  float a = fmod(angle, _2PI);
  return a >= 0 ? a : (a + _2PI);
}

I rewrote it to this, calculating the whole amount of revolutions by casting the total to an int to ‘floor’ it, and subtracting that from the total. Also changed the float divide to a multiplication, as they’re way faster to compute.

// normalizing radian angle to [0,2PI]
__attribute__((weak)) float _normalizeAngle(float angle){
	float a = angle - _2PI* ((int)(angle * _1_OVER_2PI));
	return (a < 0) ? (a + _2PI) : a;
}

This runs hunnnndreds of clock cycles faster, and is a lot less variable in execution time. :slight_smile:

Does anyone see any drawbacks to this method?

1 Like

I figured the int cast would be slower than the fmod, but if it’s faster then great.

You could also try bitmasking to eliminate the branch.

__attribute__((weak)) float _normalizeAngle(float angle){
	return (1.0f/(1<<31)) * ( (int32_t)(angle*(1<<31)) & ((1<<31)-1) );
}

Or possibly this version to eliminate the bitmask, though I’m not 100% sure that negative float values will convert to uint with the desired wrapping behavior, or that the 1<<32 constants will convert to float without truncating to 0 first (you could use 4294967296.0f to be sure):

__attribute__((weak)) float _normalizeAngle(float angle){
	return (1.0f/(1<<32)) * (uint32_t)(angle*(1<<32));
}

Of course any new version will need rigorous testing of positive and negative values and edge cases near multiples of 2pi. And potential loss of accuracy with my version due to converting the whole value back and forth, whereas yours only does the whole rotations.

I’m wondering if this is dependent of the MCU architecture, which MCU type are you using?

You could maybe write a small test program that we could use to benchmark the execution time difference, that way we have a bit more info about how much better it is.

But yeah, we’ve established the fmod as a bottleneck in past as well and minimized its use as much as possible in the code.

This is not a bit change, but as @dekutree64 says, a very critical one so we should make sure to test it well. :slight_smile:
Maybe this is the moment in life of SimpleFOC to add some unit testing.

Wrote a bit of cpp code to test the accuracy of these. Can be copied over to Online C++ Compiler - online editor and executed there.

First tests show it is a bit less accurate than the fmod itself - max difference of 0.006 (0.1%) instead of 0.002 compared to a ‘clean’ calculation using doubles. It is also faster here, but these numbers don’t really hold when compared to running it on an MCU.

I couldn’t get the ones from dekutree64 to work, will have another go at that later.

#include <iostream>
#include <cmath>  // For fmod
#include <ctime>  // For timing
#include <algorithm>
#include <cstdint> // For int32_t

const double _2PI = 2.0 * M_PI;  // Double precision 2π
const float _2PIf = 2.0f * M_PI;  // Single precision 2π
const float _1_OVER_2PIf = 1.0f / _2PIf;

// Normalizing with float using fmod
__attribute__((weak)) float _normalizeAngle_fmod(float angle) {
    float a = fmod(angle, _2PIf);
    return a >= 0 ? a : (a + _2PIf);
}

// Normalizing with float using subtraction method
__attribute__((weak)) float _normalizeAngle_subtraction(float angle) {
    float a = angle - _2PIf * ((int)(angle * _1_OVER_2PIf));
    return (a < 0) ? (a + _2PIf) : a;
}

// Normalizing with double using fmod
__attribute__((weak)) double _normalizeAngleDouble(double angle) {
    double a = fmod(angle, _2PI);
    return a >= 0 ? a : (a + _2PI);
}

// Function to calculate the mean
double mean(const double* arr, int size) {
    double sum = 0.0;
    for (int i = 0; i < size; ++i) {
        sum += arr[i];
    }
    return sum / size;
}

// Function to calculate the standard deviation
double standard_deviation(const double* arr, int size, double mean) {
    double sum = 0.0;
    for (int i = 0; i < size; ++i) {
        sum += (arr[i] - mean) * (arr[i] - mean);
    }
    return sqrt(sum / size);
}

int main() {
    const int numTests = 1000;
    float angles[numTests];
    double floatDifferences_fmod[numTests];
    double floatDifferences_subtraction[numTests];
    double doubleNormalization[numTests];

    // Seed random number generator
    srand(time(0));

    // Generate 1000 random angles between -10000 and 10000 for both float and double
    for (int i = 0; i < numTests; ++i) {
        angles[i] = static_cast<float>(rand()) / RAND_MAX * 200000.0f - 100000.0f;
    }

    clock_t start, end;

    // Measure time for double normalization
    start = clock();
    for (int i = 0; i < numTests; ++i) {
        doubleNormalization[i] = _normalizeAngleDouble(static_cast<double>(angles[i]));
    }
    end = clock();
    double time_double = double(end - start) / CLOCKS_PER_SEC;

    // Measure time for fmod method
    start = clock();
    for (int i = 0; i < numTests; ++i) {
        floatDifferences_fmod[i] = static_cast<double>(_normalizeAngle_fmod(angles[i])) - doubleNormalization[i];
    }
    end = clock();
    double time_fmod = double(end - start) / CLOCKS_PER_SEC;

    // Measure time for subtraction method
    start = clock();
    for (int i = 0; i < numTests; ++i) {
        floatDifferences_subtraction[i] = static_cast<double>(_normalizeAngle_subtraction(angles[i])) - doubleNormalization[i];
    }
    end = clock();
    double time_subtraction = double(end - start) / CLOCKS_PER_SEC;

    // Calculate statistics
    auto compute_stats = [](const double* differences, int numTests, const std::string& method, double time_taken) {
        double mean_value = mean(differences, numTests);
        double stddev_value = standard_deviation(differences, numTests, mean_value);
        double max_value = *std::max_element(differences, differences + numTests);
        double min_value = *std::min_element(differences, differences + numTests);

        std::cout << method << " vs double:";
        std::cout << "Mean: " << mean_value << "\n";
        std::cout << "Standard Deviation: " << stddev_value << "\n";
        std::cout << "Max Difference: " << max_value << "\n";
        std::cout << "Min Difference: " << min_value << "\n";
        std::cout << "Time Taken: " << time_taken << " seconds\n";
        std::cout << "------------------------------------------------------\n";
    };

    std::cout << "Statistics for the differences between float and double normalizations:\n\n";
    compute_stats(floatDifferences_fmod, numTests, "fmod (float)", time_fmod);
    compute_stats(floatDifferences_subtraction, numTests, "Subtraction (float)", time_subtraction);

    std::cout << "Double normalization time: " << time_double << " seconds\n";
    return 0;
}

@dekutree64 You were correct - on an arduino the int cast is way slower than the fmod. Arduino also doesn’t seem to have a double precision option.

This is what I’m getting:

Starting Angle Normalization Benchmark
=== PERFORMANCE BENCHMARK RESULTS ===
Timing results for angle normalization methods:
fmod method: 256 microseconds
Subtraction method: 496 microseconds
Double precision method: 252 microseconds

=== ACCURACY BENCHMARK RESULTS ===
Differences compared to double precision method (baseline):

fmod method error statistics:
Mean Error: 0.000000
Standard Deviation: 0.000000
Maximum Error: 0.000000
Minimum Error: 0.000000

Subtraction method error statistics:
Mean Error: 0.000088
Standard Deviation: 0.000079
Maximum Error: 0.000231
Minimum Error: 0.000003

With this piece of code, running on an Arduino Nano.

#include <Arduino.h>
#include <math.h>  // For fmod
#include <stdint.h> // For int32_t

const float TWO_PI_FLOAT = 2.0f * PI;  // Single precision 2π
const float ONE_OVER_TWO_PI_FLOAT = 1.0f / TWO_PI_FLOAT;
const double TWO_PI_DOUBLE = 2.0 * M_PI;  // Double precision 2π
const double ONE_OVER_TWO_PI_DOUBLE = 1.0 / TWO_PI_DOUBLE;

// Normalizing with float using fmod
float normalizeAngle_fmod(float angle) {
  float normalizedAngle = fmod(angle, TWO_PI_FLOAT);
  return normalizedAngle >= 0 ? normalizedAngle : (normalizedAngle + TWO_PI_FLOAT);
}

// Normalizing with float using subtraction method
float normalizeAngle_subtraction(float angle) {
  float normalizedAngle = angle - TWO_PI_FLOAT * ((int)(angle * ONE_OVER_TWO_PI_FLOAT));
  return (normalizedAngle < 0) ? (normalizedAngle + TWO_PI_FLOAT) : normalizedAngle;
}

// Normalizing with double using fmod
double normalizeAngle_double(double angle) {
  double normalizedAngle = fmod(angle, TWO_PI_DOUBLE);
  return normalizedAngle >= 0 ? normalizedAngle : (normalizedAngle + TWO_PI_DOUBLE);
}

// Function to calculate the mean
double calculateMean(const double* values, int count) {
  double sum = 0.0;
  for (int i = 0; i < count; ++i) {
    sum += values[i];
  }
  return sum / count;
}

// Function to calculate the standard deviation
double calculateStdDev(const double* values, int count, double mean) {
  double sumSquaredDifferences = 0.0;
  for (int i = 0; i < count; ++i) {
    sumSquaredDifferences += (values[i] - mean) * (values[i] - mean);
  }
  return sqrt(sumSquaredDifferences / count);
}

// Function to find the max value in an array
double findMaxValue(const double* values, int count) {
  double maxVal = values[0];
  for (int i = 1; i < count; ++i) {
    if (values[i] > maxVal) {
      maxVal = values[i];
    }
  }
  return maxVal;
}

// Function to find the min value in an array
double findMinValue(const double* values, int count) {
  double minVal = values[0];
  for (int i = 1; i < count; ++i) {
    if (values[i] < minVal) {
      minVal = values[i];
    }
  }
  return minVal;
}

void setup() {
  Serial.begin(115200);
  Serial.println("Starting Angle Normalization Benchmark");

  const int NUM_TESTS = 10;
  float testAngles[NUM_TESTS];
  double normalizedAngles_fmod[NUM_TESTS];
  double normalizedAngles_subtraction[NUM_TESTS];
  double normalizedAngles_double[NUM_TESTS];

  // Arrays to store the differences
  double error_fmod_vs_double[NUM_TESTS];
  double error_subtraction_vs_double[NUM_TESTS];

  // Generate random angles between -10000 and 10000 radians
  for (int i = 0; i < NUM_TESTS; ++i) {
    testAngles[i] = random(-10000, 10000);
  }

  unsigned long startTime, endTime;

  // Measure time for fmod method
  startTime = micros();
  for (int i = 0; i < NUM_TESTS; ++i) {
    normalizedAngles_fmod[i] = normalizeAngle_fmod(testAngles[i]);
  }
  endTime = micros();
  unsigned long executionTime_fmod = endTime - startTime;

  // Measure time for subtraction method
  startTime = micros();
  for (int i = 0; i < NUM_TESTS; ++i) {
    normalizedAngles_subtraction[i] = normalizeAngle_subtraction(testAngles[i]);
  }
  endTime = micros();
  unsigned long executionTime_subtraction = endTime - startTime;

  // Measure time for double method
  startTime = micros();
  for (int i = 0; i < NUM_TESTS; ++i) {
    normalizedAngles_double[i] = normalizeAngle_double(testAngles[i]);
  }
  endTime = micros();
  unsigned long executionTime_double = endTime - startTime;

  // Calculate the differences between methods
  for (int i = 0; i < NUM_TESTS; ++i) {
    error_fmod_vs_double[i] = abs(normalizedAngles_fmod[i] - normalizedAngles_double[i]);
    error_subtraction_vs_double[i] = abs(normalizedAngles_subtraction[i] - normalizedAngles_double[i]);
  }

  // Compute statistics for the errors
  double meanError_fmod = calculateMean(error_fmod_vs_double, NUM_TESTS);
  double stdDevError_fmod = calculateStdDev(error_fmod_vs_double, NUM_TESTS, meanError_fmod);
  double maxError_fmod = findMaxValue(error_fmod_vs_double, NUM_TESTS);
  double minError_fmod = findMinValue(error_fmod_vs_double, NUM_TESTS);

  double meanError_subtraction = calculateMean(error_subtraction_vs_double, NUM_TESTS);
  double stdDevError_subtraction = calculateStdDev(error_subtraction_vs_double, NUM_TESTS, meanError_subtraction);
  double maxError_subtraction = findMaxValue(error_subtraction_vs_double, NUM_TESTS);
  double minError_subtraction = findMinValue(error_subtraction_vs_double, NUM_TESTS);

  // Print benchmark results
  Serial.println("=== PERFORMANCE BENCHMARK RESULTS ===");
  Serial.println("Timing results for angle normalization methods:");
  Serial.print("fmod method: "); Serial.print(executionTime_fmod); Serial.println(" microseconds");
  Serial.print("Subtraction method: "); Serial.print(executionTime_subtraction); Serial.println(" microseconds");
  Serial.print("Double precision method: "); Serial.print(executionTime_double); Serial.println(" microseconds");

  Serial.println("\n=== ACCURACY BENCHMARK RESULTS ===");
  Serial.println("Differences compared to double precision method (baseline):");
  Serial.println();
  Serial.println("fmod method error statistics:");
  Serial.print("Mean Error: "); Serial.println(meanError_fmod, 6);
  Serial.print("Standard Deviation: "); Serial.println(stdDevError_fmod, 6);
  Serial.print("Maximum Error: "); Serial.println(maxError_fmod, 6);
  Serial.print("Minimum Error: "); Serial.println(minError_fmod, 6);
  Serial.println();
  Serial.println("Subtraction method error statistics:");
  Serial.print("Mean Error: "); Serial.println(meanError_subtraction, 6);
  Serial.print("Standard Deviation: "); Serial.println(stdDevError_subtraction, 6);
  Serial.print("Maximum Error: "); Serial.println(maxError_subtraction, 6);
  Serial.print("Minimum Error: "); Serial.println(minError_subtraction, 6);
}

void loop() {
  // Nothing to do in the loop
}

I’m on an STM32G4! Hardware support for floats, but float divides are still quite slow. I can take some normal Arduinos from the office today and give it a spin on those as well, writing the little test program you mention.

Current concept in my mind is filling a large array with a range of values, timing this, doing so for multiple methods, and computing the difference in results afterwards?

Interesting!

So it is architecture dependent. Hmm

And if you run the same experiment with you G4 MCU?

If there is a significant difference we could maybe add a compiler flag that uses the more optimal function depending on the architecture.

I have results on the STM32G4!

Clock cycles for 2000 original function calls: 1.8M
Clock cycles for 2000 ‘substraction’ calls: 1.3M

But - during testing I figured out fmod() on the STM32G4 always uses doubles. The right function to call for floats is fmodf(). Which only takes 759k cycles for the same 2000 operations. :slight_smile:

That said, the substraction method does seem to be more accurate, or at least more similar to the double variant. Not super readable but the information is there:

I’ll be moving to the fmodf() on my board! Saves me 250+ extra clock cycles over the substraction method, and 500+ over the original fmod().

Here’s the code again:

#define NUM_TESTS 2000
#define TWO_PI_FLOAT (2.0f * M_PI)
#define ONE_OVER_TWO_PI_FLOAT (1.0f / TWO_PI_FLOAT)
#define TWO_PI_DOUBLE (2.0 * M_PI)
#define ONE_OVER_TWO_PI_DOUBLE (1.0 / TWO_PI_DOUBLE)


float normalizeAngle_fmod(float angle) {
    float normalized = fmodf(angle, TWO_PI_FLOAT);
    return normalized >= 0 ? normalized : (normalized + TWO_PI_FLOAT);
}

float normalizeAngle_subtraction(float angle) {
    float normalized = angle - TWO_PI_FLOAT * ((int)(angle * ONE_OVER_TWO_PI_FLOAT));
    return normalized < 0 ? (normalized + TWO_PI_FLOAT) : normalized;
}

double normalizeAngle_double(double angle) {
    double normalized = fmod(angle, TWO_PI_DOUBLE);
    return normalized >= 0 ? normalized : (normalized + TWO_PI_DOUBLE);
}

double calculateMean(const double* values, int count) {
    double sum = 0.0;
    for (int i = 0; i < count; ++i) sum += values[i];
    return sum / count;
}

double calculateStdDev(const double* values, int count, double mean) {
    double sumSqDiff = 0.0;
    for (int i = 0; i < count; ++i)
        sumSqDiff += (values[i] - mean) * (values[i] - mean);
    return sqrt(sumSqDiff / count);
}

double findMaxValue(const double* values, int count) {
    double maxVal = values[0];
    for (int i = 1; i < count; ++i)
        if (values[i] > maxVal) maxVal = values[i];
    return maxVal;
}

double findMinValue(const double* values, int count) {
    double minVal = values[0];
    for (int i = 1; i < count; ++i)
        if (values[i] < minVal) minVal = values[i];
    return minVal;
}


typedef struct {
    uint32_t time_fmodf;
    uint32_t time_sub;
    uint32_t time_double;

    double mean_err_fmodf_vs_fmod;
    double stddev_err_fmodf_vs_fmod;
    double max_err_fmodf_vs_fmod;
    double min_err_fmodf_vs_fmod;

    double mean_err_sub_vs_fmod;
    double stddev_err_sub_vs_fmod;
    double max_err_sub_vs_fmod;
    double min_err_sub_vs_fmod;
} BenchmarkResults;

BenchmarkResults benchmarkResults;

float testAngles[NUM_TESTS];
double norm_fmod[NUM_TESTS], norm_sub[NUM_TESTS], norm_double[NUM_TESTS];
double err_fmod_vs_double[NUM_TESTS], err_sub_vs_double[NUM_TESTS], err_sub_vs_fmod[NUM_TESTS];

void fmodTest(void) {

    for (int i = 0; i < NUM_TESTS; ++i) {
        uint32_t randVal;
        HAL_RNG_GenerateRandomNumber(&hrng, &randVal);
        testAngles[i] = ((int32_t)(randVal % 20001) - 10000);
    }

    uint32_t startTime, endTime;

    startTime = DWT->CYCCNT;;
    for (int i = 0; i < NUM_TESTS; ++i) norm_fmod[i] = normalizeAngle_fmod(testAngles[i]);
    endTime = DWT->CYCCNT;
    benchmarkResults.time_fmodf = endTime - startTime;

    startTime = DWT->CYCCNT;
    for (int i = 0; i < NUM_TESTS; ++i) norm_sub[i] = normalizeAngle_subtraction(testAngles[i]);
    endTime = DWT->CYCCNT;
    benchmarkResults.time_sub = endTime - startTime;

    startTime = DWT->CYCCNT;
    for (int i = 0; i < NUM_TESTS; ++i) norm_double[i] = normalizeAngle_double(testAngles[i]);
    endTime = DWT->CYCCNT;
    benchmarkResults.time_double = endTime - startTime;

    for (int i = 0; i < NUM_TESTS; ++i) {
        err_fmod_vs_double[i] = fabs(norm_fmod[i] - norm_double[i]);
        err_sub_vs_double[i] = fabs(norm_sub[i] - norm_double[i]);
    }

    benchmarkResults.mean_err_fmodf_vs_fmod = calculateMean(err_fmod_vs_double, NUM_TESTS);
    benchmarkResults.stddev_err_fmodf_vs_fmod = calculateStdDev(err_fmod_vs_double, NUM_TESTS, benchmarkResults.mean_err_fmodf_vs_fmod);
    benchmarkResults.max_err_fmodf_vs_fmod = findMaxValue(err_fmod_vs_double, NUM_TESTS);
    benchmarkResults.min_err_fmodf_vs_fmod = findMinValue(err_fmod_vs_double, NUM_TESTS);

    benchmarkResults.mean_err_sub_vs_fmod = calculateMean(err_sub_vs_double, NUM_TESTS);
    benchmarkResults.stddev_err_sub_vs_fmod = calculateStdDev(err_sub_vs_double, NUM_TESTS, benchmarkResults.mean_err_sub_vs_fmod);
    benchmarkResults.max_err_sub_vs_fmod = findMaxValue(err_sub_vs_double, NUM_TESTS);
    benchmarkResults.min_err_sub_vs_fmod = findMinValue(err_sub_vs_double, NUM_TESTS);

    while (1);
}

Arghh in practice my motor control loop runs at the same speed with both fmod and fmodf, and I can’t figure out why. There the “subtraction method” is still fastest…

At least on Atmel Arduino, float and double are synonymous. I think it’s the same on STM32, but I haven’t actually verified.

As for my integer versions, I forgot the pi in the conversion factors. The unsigned version does work on PC, at least. The 1<<32 constants are indeed trouble, but you can do it with _PI_2 to keep the integer within range, or just precalculate them.
return (_PI_2/(1<<30)) * (uint32_t)(angle*((1<<30)/_PI_2));
return 0.00000000146291808f * (uint32_t)(angle*683565276.0f);

On ESP32, float is supported in hardware and double is emulated in software. Very different speed.