28 February, 2014

A simple optimization trick: using multiplication instead of division

Multiplication and division are related, similar operators: does it necessarily mean, that they need the same runtime? We will see, that it is worth sometimes to look behind the scenes. The optimization trick described in this post is not really MATLAB related but processor dependent. To understand its background, dig deep into the computer.

Each program consists of functions, functions are built up from sub functions. As going down on the hierarchy tree there are smaller and smaller functions and instructions, until we get to atomic ones. An atomic instruction can be processed directly by the processor, for example a simple addition, multiplication, moving a value from the memory to a register, etc. See the example below:

% a part of the main program
a = 1;
b = 4;
processNumbers(a, b);

% a function of the main program
function x = processNumbers(a, b)
  x = a + b;
  
  % a part of atomic CPU-level instructions for the addition is like this
  % ...
  % mov eax, [esi]  ; Load an integer to eax register from memory
  % mov ebx, [edi]  ; Load an integer to ebx register from memory
  % add eax, ebx    ; Add two registers
  % ...
end

An atomic instruction needs a given amount of clock cycles1 to be processed. The less cycles it needs, the faster it runs. Agner Fog has a wonderful book chapter containing latency and throughput values of different instructions on various processors. If we look at the tables2 in his document, we can see, that atomic instructions for division are almost always3 slower than instructions for multiplication. Multiplication is a faster operation.

1 For more details read Measuring Instruction Latency and Throughput written by Intel.
2 Instruction pairs for multiplication/division are: (I)MUL/(I)DIV, FMUL/FDIV, MULPS/DIVPS.
3 An exception is when the divisor is a power of two.

How to use this when programming in MATLAB?

Let us see a code example below, which scales an input signal into the [0, 1] range:

% generate a random vector
A = rand(1, 10000);         

% do the scaling into the [0,1] range and measure the runtime
m = min(A);
M = max(A);
tic;
A = (A - m) / (M - m);
toc;

The average runtime of the division is 0.199 milliseconds. If we look at the code, we can see, that the whole array is divided by a constant. This can be replaced by multiplication:

$$ \frac{A - m}{M - m} = (A - m)\cdot\frac{1}{M - m} $$

Rewrite our code, eliminate the division, and measure the runtime again:

% generate a random vector
A = rand(1, 10000);         

% do the scaling into the [0,1] range and measure the runtime
m = min(A);
M = max(A);
c = 1 / (M - m);
tic;
A = (A - m) * c;
toc;

The result is 0.113 milliseconds, which is about two times faster, than the code using division. Although the ratio may vary depending on the type of the processor, multiplication never will be slower.

The trick works well in most cases. When the constant is extremely small or big, the reciprocal may cause loss of precision due to floating point representation of numbers.

         

New comment

comments powered by Disqus