is an interpreted programming language. Originally developed for linear algebra and engineering problems, but now with wide applicability and toolboxes for areas ranging from medicine, economics, and machine learning.
A good way to introduce yourself to a new language is by trying to solve a "non-trivial" problem; learning the tools and syntax necessary to solve the problem along the way. This motivates the syntax/tools in a "why" versus "what" way!
help@scc.bu.edu
jbevan@bu.edu
bgregor@bu.edu
format compact
Github repository of data we will use
https://github.com/bu-rcs/bu-rcs.github.io/tree/main/Bootcamp/Data
Citation
University of Wisconsin Population Health Institute. County Health Rankings & Roadmaps 2019.www.countyhealthrankings.org.
Data Source
https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation
How do we go about doing this?
Pseudo-code:
Read-in data file
Format contents how we want
Pre-allocate "data" matrix (to be all zero?)
Loop through formatted data
(Process data to extract features of interest, categorize, etc)
1: Calculate state averages
2: ?
3: ?
end
Output interesting results
Generate visualizations/plots
How do we read in a csv?
When I first was making this tutorial I forgot, since I don't usually need to for my background. So I googled...
readcsv()
readmatrix()
readtable()
opts=detectImportOptions()
test2=readmatrix("NE_HealthData.csv")
opts=detectImportOptions("NE_HealthData.csv")
readmatrix("NE_HealthData.csv",opts)
Primitives: Integers/Floating Point/Characters/Strings/Booleans/etc.
Integers have some interesting behavior when you work with unsigned versions or exceed their range:
a=uint8(2)
b=uint8(30)
a-b
c=int8(100)
d=int8(50)
c+d
Strings and characters sound the same, but can behave differently:
bp_char='atcg';
bp_string="atcg";
class(bp_char)
class(bp_string)
format long
a = double(2/3)
b = single(2/3)
eps(a)
eps(b)
format short
Sructs:
for i=1:10
% What happens if we use bp_string here instead?
patient(i).dna = bp_char(randi(4,1,20));
end
patient(2).dna
class(patient)
Maps:
mymap = containers.Map(["smallest prime","dull number","days in year"],[2,1729,365])
mymap("days in year")
T = readtable("NE_HealthData.csv");
T(1,1:5)
T(1:11,2)
T.State(1:11)
T(1,4)
T(1,4)+1
How do we index vectors/matrices (aka "learn to love colons"). A quick sidebar:
% Quick way to make a "test" matrix of any size
M = magic(5)
% Matlab is "column-major" and "one indexed"
disp(M(1,1))
disp(M(1,2))
disp(M(3,1))
disp(M(5,5))
% Slices
M(1,:)
M(2,:)
M(:,1)
% Ranges
M(1:3,2)
Back to working with our dataset:
size(T)
T(1:10,4)
If we want to do any numerical operations on the columns of our data we need to put it into an appropriate numerical data type...
We can do this with table2array()
table2array(T(1:10,4))
T(1,1:5)
T(1,6:8)
T(1,9:12)
T(1,13:15)
T(1:20,10)
How do we deal with missing values here?
We'd like to be able to process on a state-by-state basis:
states=unique(T.State)
How do we find a particular state in the table?
T(strcmp(T.State,"Connecticut"),1:5)
What happens if we might need to do this over and over many times? Is there any potential performance issue with this?
If so how can we mitigate it?
strcmp(T.State,states)
state_inds=cellfun(@(c)strcmp(c,T.State),states,'UniformOutput',false)
state_inds2=cell2mat(reshape(state_inds,1,6))
mymap = containers.Map(states,state_inds)
T(mymap("Connecticut"),{'State','County'})
table2array(T(mymap("Connecticut"),[4:9,11:15]))
averages = zeros(numel(states),11);
it = 1;
for state = states
averages(it,:) = mean(table2array(T(mymap(state),[4:9,11:15])))
it = it + 1;
end
mymap.values()