Me: Josh Bevan - jbevan@bu.edu
Get Help from RCS: help@scc.bu.edu
Our website: rcs.bu.edu
Tutorial eval: rcs.bu.edu/eval
This notebook: http://scv.bu.edu/examples/matlab/Tutorials/PerfOpt/
Programs are run with finite resources: memory, cpu, disk, time, etc. For small or simple programs these limitations are unimportant; the program finishes running fast enough and within your constraints so that these limitations have no noticeable effect. However, most non-trivial programs eventually reach one of these constraints; usually memory or running time.
Performance optimization in the context of this tutorial then means improvement of a program to minimize the effect of our computing environments limitations.
You can break techniques to "PO" into several categories:
Not 6: Parallelization. Improves performance, but does not "optimize" (actually usually has efficiency penalty).
format compact
Brief sidebar: Recall vectorization... Vectorization is the process of performing the same operation on multiple pieces of data at the same time. This has to do with MATLAB's language/implementation specifics, but is also generally applicable to most other interpreted languages.
It often takes the form of converting "tight" loops into operations on vectors/matrices.
Here's an example:
nums = rand(10000,1);
%============
tic
total = zeros(10000,1);
for t=1:1000
for i=1:10000
total(i) = sin(nums(i));
end
end
toc
%============
tic
for t=1:1000
total = sin(nums);
end
toc
Depending on MATLAB version, they may be close or not. Newer versions are able to better JIT (just-in-time) compile simple patterns. Consider this very similiar version and compare:
nums = rand(10000,1);
%============
tic
total = zeros(10000,1);
for t=1:1000
for i=1:10000
sin(nums(i));
end
end
toc
%============
tic
for t=1:1000
sin(nums);
end
toc
We can easily make situations where MATLAB has a hard time JITing. Try encapsulating operations inside functions:
%%file test.m
function out = test(a,b)
out = a + b;
end
%%file dummy.m
function out = dummy(in)
end
nums = randi(100,10000,1); % Creates a 10000x1 matrix with random integers values 1 to 100 inclusive
% Original task:
tic
for t=1:10000
total = 0;
for i=1:10000
total = total + nums(i);
end
end
toc
% Function call overhead:
tic
for t=1:10000
total = 0;
for i=1:10000
total = total + nums(i);
dummy(42)
end
end
toc
% Original task encapsulated in function to prevent JITing:
tic
for t=1:10000
total = 0;
for i=1:10000
total = test(total, nums(i));
end
end
toc
% Optimized built-ins are even better and should be used when possible:
tic
for t=1:10000
total = sum(nums);
end
toc
Consider a simple example that shows all of the above categories of optimization: We want to calculate cumulative sum (prefix sum) for an array of numbers
nums = randi(100,10,1)
total = 0;
for i=1:numel(nums)
total = total + nums(i);
end
total
%%file csum.m
% Calculate a cumulative sum from a to b
% Does same amount of work as we would for a "real" csum
function total = csum(a,b)
total = 0;
for i=a:b
total = total + i;
end
end
%%file csum2.m
% Calculate a cumulative sum from a to b
function total = csum2(a,b)
total = zeros(numel(a:b),1);
for i=a:b
total(i) = total(i-1) + i;
end
end
% Get all the cumulative sums from 1, for the numbers 1 to 100
tic
for i=1:(1*43000)
myresults(i) = csum(1,i);
end
toc
logspace(1,5,13)
% Look at the running time as we increase the range of numbers we get the cumsum for
c = 1;
for N=logspace(1,5,13)
tic
for i=1:N
myresults(i) = csum(1,i);
end
timings(c) = toc;
c = c+1;
end
plot(logspace(1,5,13),timings,'o-')
xlabel("N")
ylabel("Runtime (s)")
We can see the above plot shows our running time does not increase linearly as we increase $N$. We can talk about the "asymptotic" behavior of our program using "big O" notation. This describes the running time dependent on some critical parameter(s). In this case our critical parameter is $N$ and we'll show that our program scales as: $\mathcal{O}(N^2)$
% Remember "log rules":
log(2^8)
8*log(2)
% log-log plot of above, with linear regression in asymptotic region
loglog(logspace(1,5,13),timings,'o-')
xlabel("N")
ylabel("Runtime (s)")
polyfit(log(logspace(3,5,7)),log(timings(7:end)),1)
What can we improve?
mysize=4300000;
clear myresults
tic
myresults(1)=1;
for i=2:mysize
myresults(i) = myresults(i-1)+i;
end
toc
clear myresults
clear myresults
tic
myresults = zeros(mysize,1);
for i=2:mysize
myresults(i) = myresults(i-1)+i;
end
toc
clear myresults
c = 1;
xx = logspace(1,7,13);
for N=xx
tic
myresults(1)=1;
for i=2:N
myresults(i) = myresults(i-1)+i;
end
timings(c) = toc;
c = c+1;
end
loglog(xx,timings)
polyfit(log(xx(end-5:end)),log(timings(end-5:end)),1)
% If time permits...
nums=randi(100,10,1)
% What if we want to calculate a cumulative sum every time an entry changes?
total = zeros(10,1);
total(1)=nums(1);
for i=2:numel(nums)
total(i) = total(i-1) + nums(i);
end
before=total
nums(randi(numel(nums)))=99
total(1)=nums(1);
for i=2:numel(nums)
total(i) = total(i-1) + nums(i);
end
after=total
after-before
% How can we modify this to take advantage of our *a priori* knowledge?
total = zeros(10,1);
total(1)=nums(1);
for i=2:numel(nums)
total(i) = total(i-1) + nums(i);
end
before=total
Fibonacci sequence: 1,1,2,3,5,8,13
%%file myfib.m
function f = myfib(n)
if n < 2
f = n;
return
else
f = myfib(n-1) + myfib(n-2);
end
end
tic
myfib(10)
toc
timer = zeros(40,1);
for i=1:40
tic
myfib(i);
timer(i) = toc;
if i>30
i
end
end
loglog(timer(10:end))
https://en.wikipedia.org/wiki/Fenwick_tree
"A flat array of N values can either store the values or the prefix sums. In the first case, computing prefix sums requires linear time; in the second case, updating the array values requires linear time (in both cases, the other operation can be performed in constant time)."
nums=randi(100,10,1)
N = numel(nums);
total = nums;
for i=2:N
total(i) = total(i-1) + nums(i);
end
total
Sometimes we have two sets of data etc. that we need to interact across all elements. Imagine you have a series of partially full boxes and a variety of items you can pack in them, but you can't exceed a max weight.
myboxes = [9, 11, 7, 20, 1, 19];
items = [1, 3, 4, 10, 19];
max_weight = 20;
res = zeros(numel(myboxes),numel(items));
c = 1;
for i=myboxes
res(c,:)=(i + items);
c=c+1;
end
res
res<=max_weight
repmat(myboxes',1,numel(items))+repmat(items,numel(myboxes),1)
https://www.mathworks.com/help/matlab/ref/bsxfun.html
bsxfun(@plus,myboxes',items)
Another "advanced" technique: Use linear algebra when you can (and know how):
cool
Can we write a short program using loops to find the "neighbor sum" for each element?
Is there a better way?
A=toeplitz([[1 1] zeros(1,5-2)])
cool
A*cool*A
Program bottlenecks and code performance will often defy your intuition. Therefore it's important to empirically measure what's going on.
Consider a simple example: based on what we've learned which of these is faster?
tic
for t=1:100000
total = 0;
for i=1:10000
total = total + i;
end
end
toc
tic
for t=1:100000
total = sum(1:10000);
end
toc
total
For non-trivial programs this measurement approach is insufficient. Let's take a look at a "real" program and see how we can profile it in MATLAB.
https://en.wikipedia.org/wiki/Introduction_to_Algorithms (CLRS)
Handbook of Mathematical Functions aka Abramowitz and Stegun:
https://en.wikipedia.org/wiki/Abramowitz_and_Stegun and https://dlmf.nist.gov/