Me: Josh Bevan - jbevan@bu.edu
Helper: Ahmed Aly - aaly@bu.edu
Get Help from RCS: help@scc.bu.edu

Our website: rcs.bu.edu
Tutorial eval: rcs.bu.edu/eval This notebook: http://scv.bu.edu/examples/matlab/Tutorials/PerfOpt/

What is "performance optimization"?¶

Programs are run with finite resources: memory, cpu, disk, time, etc. For small or simple programs these limitations are unimportant; the program finishes running fast enough and within your constraints so that these limitations have no noticeable effect. However, most non-trivial programs eventually reach one of these constraints; usually memory or running time.

Performance optimization in the context of this tutorial then means improvement of a program to minimize the effect of our computing environments limitations.

You can break techniques to "PO" into several categories:

Memory access efficiency
Vectorization
Use of "a priori" knowledge to specialize approach/methods
Making use of/avoiding computer and language strengths/weaknesses
Algorithmic improvement: asymptotic/"big O"

Not 6: Parallelization. Improves performance, but does not "optimize" (actually usually has efficiency penalty).

format compact

Brief sidebar: Recall vectorization... Vectorization is the process of performing the same operation on multiple pieces of data at the same time. This has to do with MATLAB's language/implementation specifics, but is also generally applicable to most other interpreted languages.

It often takes the form of converting "tight" loops into operations on vectors/matrices.

Here's an example:

nums = rand(10000,1);
tic
total = zeros(10000,1);
for t=1:1000
    for i=1:10000
        total(i) = sin(nums(i));
    end
end
toc

tic
for t=1:1000
    total = sin(nums);
end
toc

Depending on MATLAB version, they may be close or not. Newer versions are able to better JIT (just-in-time) compile simple patterns. Consider this very similiar version and compare:

nums = rand(10000,1);
tic
total = zeros(10000,1);
for t=1:1000
    for i=1:10000
        sin(nums(i));
    end
end
toc

tic
for t=1:1000
    sin(nums);
end
toc

We can easily make situations where MATLAB has a hard time JITing. Try encapsulating operations inside functions:

%%file test.m

function out = test(a,b)
    out = a + b;
end

%%file dummy.m

function out = dummy(in)
end

nums = randi(100,10000,1);

% Original task:
tic
for t=1:10000
    total = 0;
    for i=1:10000
        total = total + nums(i);
    end
end
toc

% Function call overhead:
tic
for t=1:10000
    total = 0;
    for i=1:10000
        total = total + nums(i);
        dummy(42)
    end
end
toc

% Original task encapsulated in function to prevent JITing:
tic
for t=1:10000
    total = 0;
    for i=1:10000
        total = test(total, nums(i));
    end
end
toc

% Optimized built-ins are even better and should be used when possible:
tic
for t=1:10000
    total = sum(nums);
end
toc

Consider a simple example that shows all of the above categories of optimization:

%%file csum.m

% Calculate a cumulative sum from a to b
function total = csum(a,b)
    total = 0;
    for i=a:b
        total = total + i;
    end
end

% Get all the cumulative sums from 1, for the numbers 1 to 100
tic
for i=1:(1*43000)
    myresults(i) = csum(1,i);
end
toc

% Look at the running time as we increase the range of numbers we get the cumsum for
c = 1;
for N=logspace(1,5,13)
    tic
    for i=1:N
        myresults(i) = csum(1,i);
    end
    timings(c) = toc;
    c = c+1;
end
plot(logspace(1,5,13),timings,'o-')
xlabel("N")
ylabel("Runtime (s)")

We can see the above plot shows our running time does not increase linearly as we increase $N$. We can talk about the "asymptotic" behavior of our program using "big O" notation. This describes the running time dependent on some critical parameter(s). In this case our critical parameter is $N$ and we'll show that our program scales as: $\mathcal{O}(N^2)$

% Remember "log rules":
log(2^8)
8*log(2)

% log-log plot of above, with linear regression in asymptotic region
loglog(logspace(1,5,13),timings,'o-')
xlabel("N")
ylabel("Runtime (s)")
polyfit(log(logspace(3,5,7)),log(timings(7:end)),1)

Here's our program again, what can we improve?

Memory access efficiency
Vectorization
Use of "a priori" knowledge to specialize approach/methods
Making use of/avoiding computer and language strengths/weaknesses
Algorithmic improvement: asymptotic/"big O"

%%file dummy

function total = csum(a,b)
    total = 0;
    for i=a:b
        total = total + i;
    end
end

tic
for i=1:10
    myresults(i) = csum(1,i);
end
toc

Memory access efficiency:

preallocation versus resizing
memory access patterns (e.g. column vs row "strides")

mysize=430000;

clear myresults
tic
myresults(1)=1;
for i=2:mysize
    myresults(i) = myresults(i-1)+i;
end
toc
clear myresults

clear myresults
tic
myresults = zeros(mysize,1);
for i=2:mysize
    myresults(i) = myresults(i-1)+i;
end
toc
clear myresults

c = 1;
xx = logspace(1,9,13);
for N=xx
    tic
    myresults(1)=1;
    for i=2:N
        myresults(i) = myresults(i-1)+i;
    end
    timings(c) = toc;
    c = c+1;
end
loglog(xx,timings)
polyfit(log(xx(end-5:end)),log(timings(end-5:end)),1)

Vectorization

% If time permits...

a priori knowledge

% What if we want to calculate a cumulative sum every time an entry changes?
total = zeros(10,1);
total(1)=nums(1);
for i=2:numel(nums)
    total(i) = total(i-1) + nums(i);
end
before=total

nums(randi(numel(nums)))=99

total(1)=nums(1);
for i=2:numel(nums)
    total(i) = total(i-1) + nums(i);
end
after=total

after-before

Language strengths/weaknesses

Fibonacci sequence: 1,1,2,3,5,8,13

%%file myfib.m

function f = myfib(n)
    if n < 2
        f = n;
        return
    else
        f = myfib(n-1) + myfib(n-2);
    end
end

tic
myfib(10)
toc

timer = zeros(40,1);
for i=1:40
    tic
    myfib(i);
    timer(i) = toc;
    if i>30
        i
    end
end

loglog(timer)

Algorithmic improvements

N=349875;
sum(1:N)

myspecialfunction(N)

Advanced Vectorization Methods¶

Sometimes we have two sets of data etc. that we need to interact across all elements. Imagine you have a series of partially full boxes and a variety of items you can pack in them, but you can't exceed a max weight.

myboxes = [9, 11, 7, 20, 1, 19];
items = [1, 3, 4, 10, 19];
max_weight = 20;

res = zeros(numel(myboxes),numel(items));
c = 1;
for i=myboxes
    res(c,:)=(i + items);
    c=c+1;
end
res
res<=max_weight

https://www.mathworks.com/help/matlab/ref/bsxfun.html

Binary
Singleton
Expansion

bsxfun(@plus,myboxes',items)

Another "advanced" technique: Use linear algebra when you can (and know how):

cool

Can we write a short program using loops to find the "neighbor sum" for each element?

Is there a better way?

A=toeplitz([[1 1] zeros(1,5-2)])

cool
A*cool*A

Instrumentation/metrics¶

Program bottlenecks and code performance will often defy your intuition. Therefore it's important to empirically measure what's going on.

Consider a simple example: based on what we've learned which of these is faster?

tic
for t=1:100000
    total = 0;
    for i=1:10000
        total = total + i;
    end
end
toc

tic
for t=1:100000
    total = sum(1:10000);
end
toc

total

For non-trivial programs this measurement approach is insufficient. Let's take a look at a "real" program and see how we can profile it in MATLAB.

Recommended algorithms starting places:¶

https://en.wikipedia.org/wiki/Introduction_to_Algorithms (CLRS)

Handbook of Mathematical Functions aka Abramowitz and Stegun:

https://en.wikipedia.org/wiki/Abramowitz_and_Stegun and https://dlmf.nist.gov/