{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Me: Josh Bevan - jbevan@bu.edu \n", "Get Help from RCS: help@scc.bu.edu\n", "\n", "Our website: [rcs.bu.edu](http://rcs.bu.edu) \n", "Tutorial eval: [rcs.bu.edu/eval](http://rcs.bu.edu/eval) \n", "This notebook: http://scv.bu.edu/examples/matlab/Tutorials/PerfOpt/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **What is \"performance optimization\"?**\n", "Programs are run with finite resources: memory, cpu, disk, time, etc. For small or simple programs these limitations are unimportant; the program finishes running fast enough and within your constraints so that these limitations have no noticeable effect. However, most non-trivial programs eventually reach one of these constraints; usually memory or running time.\n", "\n", "Performance optimization in the context of this tutorial then means improvement of a program to minimize the effect of our computing environments limitations.\n", "\n", "You can break techniques to \"PO\" into several categories:\n", "1. Memory access efficiency\n", "2. Vectorization\n", "3. Use of \"a priori\" knowledge to specialize approach/methods\n", "4. Making use of/avoiding computer and language strengths/weaknesses\n", "5. Algorithmic improvement: asymptotic/\"big O\"\n", "\n", "Not 6: Parallelization. Improves performance, but does not \"optimize\" (actually usually has efficiency penalty)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "format compact" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Brief sidebar: Recall vectorization...**\n", "Vectorization is the process of performing the same operation on multiple pieces of data at the same time. This has to do with MATLAB's language/implementation specifics, but is also generally applicable to most other *interpreted* languages.\n", "\n", "It often takes the form of converting \"tight\" loops into operations on vectors/matrices.\n", "\n", "Here's an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums = rand(10000,1);\n", "%============\n", "tic\n", "total = zeros(10000,1);\n", "for t=1:1000\n", " for i=1:10000\n", " total(i) = sin(nums(i));\n", " end\n", "end\n", "toc\n", "%============\n", "tic\n", "for t=1:1000\n", " total = sin(nums);\n", "end\n", "toc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on MATLAB version, they may be close or not. Newer versions are able to better JIT (just-in-time) compile simple patterns. Consider this very similiar version and compare:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums = rand(10000,1);\n", "%============\n", "tic\n", "total = zeros(10000,1);\n", "for t=1:1000\n", " for i=1:10000\n", " sin(nums(i));\n", " end\n", "end\n", "toc\n", "%============\n", "tic\n", "for t=1:1000\n", " sin(nums);\n", "end\n", "toc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can easily make situations where MATLAB has a hard time JITing. Try encapsulating operations inside functions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%file test.m\n", "\n", "function out = test(a,b)\n", " out = a + b;\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%file dummy.m\n", "\n", "function out = dummy(in)\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums = randi(100,10000,1); % Creates a 10000x1 matrix with random integers values 1 to 100 inclusive\n", "\n", "% Original task:\n", "tic\n", "for t=1:10000\n", " total = 0;\n", " for i=1:10000\n", " total = total + nums(i);\n", " end\n", "end\n", "toc\n", "\n", "% Function call overhead:\n", "tic\n", "for t=1:10000\n", " total = 0;\n", " for i=1:10000\n", " total = total + nums(i);\n", " dummy(42)\n", " end\n", "end\n", "toc\n", "\n", "% Original task encapsulated in function to prevent JITing:\n", "tic\n", "for t=1:10000\n", " total = 0;\n", " for i=1:10000\n", " total = test(total, nums(i));\n", " end\n", "end\n", "toc\n", "\n", "% Optimized built-ins are even better and should be used when possible:\n", "tic\n", "for t=1:10000\n", " total = sum(nums);\n", "end\n", "toc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider a simple example that shows all of the above categories of optimization:\n", "We want to calculate cumulative sum (prefix sum) for an array of numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums = randi(100,10,1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total = 0;\n", "for i=1:numel(nums)\n", " total = total + nums(i);\n", "end\n", "total" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%file csum.m\n", "\n", "% Calculate a cumulative sum from a to b\n", "% Does same amount of work as we would for a \"real\" csum\n", "function total = csum(a,b)\n", " total = 0;\n", " for i=a:b\n", " total = total + i;\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%file csum2.m\n", "\n", "% Calculate a cumulative sum from a to b\n", "function total = csum2(a,b)\n", " total = zeros(numel(a:b),1);\n", " for i=a:b\n", " total(i) = total(i-1) + i;\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% Get all the cumulative sums from 1, for the numbers 1 to 100\n", "tic\n", "for i=1:(1*43000)\n", " myresults(i) = csum(1,i);\n", "end\n", "toc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logspace(1,5,13)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% Look at the running time as we increase the range of numbers we get the cumsum for\n", "c = 1;\n", "for N=logspace(1,5,13)\n", " tic\n", " for i=1:N\n", " myresults(i) = csum(1,i);\n", " end\n", " timings(c) = toc;\n", " c = c+1;\n", "end\n", "plot(logspace(1,5,13),timings,'o-')\n", "xlabel(\"N\")\n", "ylabel(\"Runtime (s)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the above plot shows our running time does not increase linearly as we increase $N$. We can talk about the \"asymptotic\" behavior of our program using \"big O\" notation. This describes the running time dependent on some critical parameter(s). In this case our critical parameter is $N$ and we'll show that our program scales as: $\\mathcal{O}(N^2)$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% Remember \"log rules\":\n", "log(2^8)\n", "8*log(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "% log-log plot of above, with linear regression in asymptotic region\n", "loglog(logspace(1,5,13),timings,'o-')\n", "xlabel(\"N\")\n", "ylabel(\"Runtime (s)\")\n", "polyfit(log(logspace(3,5,7)),log(timings(7:end)),1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What can we improve?\n", "1. Memory access efficiency\n", "2. Vectorization\n", "3. Use of \"a priori\" knowledge to specialize approach/methods\n", "4. Making use of/avoiding computer and language strengths/weaknesses\n", "5. Algorithmic improvement: asymptotic/\"big O\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Memory access efficiency\n", "- preallocation versus resizing\n", "- memory access patterns (e.g. column vs row \"strides\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mysize=4300000;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clear myresults\n", "tic\n", "myresults(1)=1;\n", "for i=2:mysize\n", " myresults(i) = myresults(i-1)+i;\n", "end\n", "toc\n", "clear myresults" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clear myresults\n", "tic\n", "myresults = zeros(mysize,1);\n", "for i=2:mysize\n", " myresults(i) = myresults(i-1)+i;\n", "end\n", "toc\n", "clear myresults" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c = 1;\n", "xx = logspace(1,7,13);\n", "for N=xx\n", " tic\n", " myresults(1)=1;\n", " for i=2:N\n", " myresults(i) = myresults(i-1)+i;\n", " end\n", " timings(c) = toc;\n", " c = c+1;\n", "end\n", "loglog(xx,timings)\n", "polyfit(log(xx(end-5:end)),log(timings(end-5:end)),1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% If time permits..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *a priori* knowledge" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums=randi(100,10,1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% What if we want to calculate a cumulative sum every time an entry changes?\n", "total = zeros(10,1);\n", "total(1)=nums(1);\n", "for i=2:numel(nums)\n", " total(i) = total(i-1) + nums(i);\n", "end\n", "before=total" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums(randi(numel(nums)))=99" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total(1)=nums(1);\n", "for i=2:numel(nums)\n", " total(i) = total(i-1) + nums(i);\n", "end\n", "after=total" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "after-before" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "% How can we modify this to take advantage of our *a priori* knowledge?\n", "total = zeros(10,1);\n", "total(1)=nums(1);\n", "for i=2:numel(nums)\n", " total(i) = total(i-1) + nums(i);\n", "end\n", "before=total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Language strengths/weaknesses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fibonacci sequence: 1,1,2,3,5,8,13" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%file myfib.m\n", "\n", "function f = myfib(n)\n", " if n < 2\n", " f = n;\n", " return\n", " else\n", " f = myfib(n-1) + myfib(n-2);\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tic\n", "myfib(10)\n", "toc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timer = zeros(40,1);\n", "for i=1:40\n", " tic\n", " myfib(i);\n", " timer(i) = toc;\n", " if i>30\n", " i\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "loglog(timer(10:end))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Algorithmic improvements\n", "\n", "*https://en.wikipedia.org/wiki/Fenwick_tree*
\n", "\"A flat array of N values can either store the values or the prefix sums. In the first case, computing prefix sums requires linear time; in the second case, updating the array values requires linear time (in both cases, the other operation can be performed in constant time).\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums=randi(100,10,1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = numel(nums);\n", "total = nums;\n", "for i=2:N\n", " total(i) = total(i-1) + nums(i);\n", "end\n", "total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **Advanced Vectorization Methods**\n", "\n", "Sometimes we have two sets of data etc. that we need to interact across all elements. Imagine you have a series of partially full boxes and a variety of items you can pack in them, but you can't exceed a max weight." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "myboxes = [9, 11, 7, 20, 1, 19];\n", "items = [1, 3, 4, 10, 19];\n", "max_weight = 20;\n", "\n", "res = zeros(numel(myboxes),numel(items));\n", "c = 1;\n", "for i=myboxes\n", " res(c,:)=(i + items);\n", " c=c+1;\n", "end\n", "res\n", "res<=max_weight" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "repmat(myboxes',1,numel(items))+repmat(items,numel(myboxes),1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://www.mathworks.com/help/matlab/ref/bsxfun.html\n", "\n", "- Binary\n", "- Singleton\n", "- Expansion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bsxfun(@plus,myboxes',items)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another \"advanced\" technique: Use linear algebra when you can (and know how):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cool" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we write a short program using loops to find the \"neighbor sum\" for each element?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is there a better way?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A=toeplitz([[1 1] zeros(1,5-2)])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cool\n", "A*cool*A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **Instrumentation/metrics**\n", "Program bottlenecks and code performance will often defy your intuition. Therefore it's important to *empirically measure* what's going on.\n", "\n", "Consider a simple example: based on what we've learned which of these is faster?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tic\n", "for t=1:100000\n", " total = 0;\n", " for i=1:10000\n", " total = total + i;\n", " end\n", "end\n", "toc\n", "\n", "tic\n", "for t=1:100000\n", " total = sum(1:10000);\n", "end\n", "toc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For non-trivial programs this measurement approach is insufficient. Let's take a look at a \"real\" program and see how we can profile it in MATLAB." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# **Recommended algorithms starting places:**\n", "\n", "https://en.wikipedia.org/wiki/Introduction_to_Algorithms (CLRS)\n", "\n", "Handbook of Mathematical Functions aka Abramowitz and Stegun:\n", "\n", "https://en.wikipedia.org/wiki/Abramowitz_and_Stegun\n", "and\n", "https://dlmf.nist.gov/" ] } ], "metadata": { "kernelspec": { "display_name": "MATLAB", "language": "matlab", "name": "imatlab" }, "language_info": { "codemirror_mode": "octave", "file_extension": ".m", "mimetype": "text/x-matlab", "name": "matlab", "pygments_lexer": "matlab", "version": "9.12.0.1884302 (R2022a)" } }, "nbformat": 4, "nbformat_minor": 2 }