is an interpreted programming language. Originally developed for linear algebra and engineering problems, but now with wide applicability and toolboxes for areas ranging from medicine, economics, and machine learning.
A good way to introduce yourself to a new language is by trying to solve a "non-trivial" problem; learning the tools and syntax necessary to solve the problem along the way. This motivates the syntax/tools in a "why" versus "what" way!
help@scc.bu.edu
jbevan@bu.edu
bgregor@bu.edu
format compact
Github repository of data we will use
https://github.com/bu-rcs/bu-rcs.github.io/tree/main/Bootcamp/Data
Citation
University of Wisconsin Population Health Institute. County Health Rankings & Roadmaps 2019.www.countyhealthrankings.org.
Data Source
https://www.countyhealthrankings.org/explore-health-rankings/rankings-data-documentation
x_FairOrPoorHealth x_Smokers x_AdultsWithObesity x_PhysicallyInactive x_WithAccessToExerciseOpportunities x_ExcessiveDrinking x_SomeCollege x_ChildrenInPoverty x_SevereHousingProblems x_DriveAloneToWork x_LongCommute_DrivesAlone
T = readtable("NE_HealthData.csv");
states=unique(T.State);
state_inds=cellfun(@(c)strcmp(c,T.State),states,'UniformOutput',false);
mymap = containers.Map(states,state_inds);
T(mymap("Connecticut"),{'FIPS','County','x_Smokers'});
Tnum = table2array(T(:,[4:9,11:15]));
colmap = containers.Map(1:11,[4:9,11:15]);
Examples of using our data:
averages = zeros(numel(states),11);
it = 1;
for state = 1:6
averages(it,:) = mean(table2array(T(mymap(states{state}),[4:9,11:15])));
it = it + 1;
end
T.County(T.x_ChildrenInPoverty>10);
Some examples:
Correlation Plots
%plot inline -s 800,800
myscatter = scatter(T.x_ChildrenInPoverty, T.x_Smokers,'.r');
title("Poverty vs Smoking for NE Counties")
xlabel('% Children in Poverty')
ylabel('% Smokers')
ax = ancestor(myscatter,'axes');
ax.YAxis.Exponent=0;
ytickformat('%2.1f')
axis tight
hold on
%Plot linear regression and 95% confidence interval
[coeff, S] = polyfit(T.x_ChildrenInPoverty, T.x_Smokers,1);
xFit = linspace(min(T.x_ChildrenInPoverty), max(T.x_ChildrenInPoverty), 100);
[yFit, delta] = polyval(coeff, xFit, S);
plot(xFit, yFit, 'b-', 'LineWidth', 2);
plot(xFit,yFit+2*delta,'m--',xFit,yFit-2*delta,'m--')
legend('Data','Linear Fit','95% Prediction Interval','Location','northwest')
Crude Geographic Plots
Census Data Source:
https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html
Direct link:
https://www2.census.gov/geo/docs/maps-data/data/gazetteer/2019_Gazetteer/2019_Gaz_counties_national.zip
C=readtable('counties.txt');
lat = C.INTPTLAT;
long = C.INTPTLONG;
%plot -s 800,400
contig = long>-125 & long<-50 & lat>24;
clong = long(contig);
clat = lat(contig);
scatter(clong,clat,[],rand(numel(clong),1)*20+clong,'*')
axis tight equal
%plot inline
dt = delaunayTriangulation(clong,clat);
tri = dt.ConnectivityList;
xy = [clong, clat];
maxEdge = 2.6;
edges = [tri(:,[1 2]);tri(:,[1 3]);tri(:,[2 3])];
tri(any(reshape(sqrt((xy(edges(:,1),1) - xy(edges(:,2),1)).^2 + (xy(edges(:,1),2) - xy(edges(:,2),2)).^2),[],3) > maxEdge,2),:) = [];
figure
h = trisurf(tri,clong,clat,rand(numel(clong),1)*20+clong);
set(h,'edgecolor','none')
%shading interp
view(2)
axis tight
Source data:
https://github.com/nytimes/covid-19-data
Direct link:
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv
% Read-in data file
T=readtable("us-states.csv");
% Format contents how we want
cases = table2array(T(:,4));
date = table2array(T(:,1));
elapsed = days(date - date(1));
states = table2array(T(:,2));
unique_states = unique(states);
mymap = containers.Map(unique_states,1:55);
state_nums=cell2mat(mymap.values(states));
% Pre-allocate "infection" and "day" matrices
infection = zeros(numel(unique_states), elapsed(end)+1);
day = infection;
% Pre-allocate "state_counter" to be 1 for all states
state_counter = ones(1,numel(unique_states));
% Loop through formatted data
for record = 1:numel(cases)
% For next record, get correct row=state and column=state_counter(state)
row = state_nums(record);
column = state_counter(row);
state_counter(row) = state_counter(row) + 1;
% Put # infections in infection(row,column)
infection(row,column) = cases(record);
day(row,column) = elapsed(record);
% Put day in day(row,column)
end
% Find first day where infection>100, use this as starting day in plot
% Plot data for all states, starting from selected day
%plot inline -s 700,500
[vals,inds] = sort(max(infection'));
inds=inds(10000<vals & vals<700000);
for i = numel(inds):-5:1
want = infection(inds(i),:)>100;
h = plot(infection(inds(i), want));
hold on
end
title("I")
xlabel('Relative Day')
ylabel('Infections')
legend(unique_states(inds(numel(inds):-5:1)),'Location','northwest')
ax = ancestor(h,'axes');
ax.YAxis.Exponent=0;
ytickformat('%2.1f')
axis tight
Let's try to do some of the following:
-Correlation analysis. Which columns are dependent on other columns?
Fit some linear models to correlated columns. What's the quality of the fit? P-value?
-Take the age expectancy histogram. Fit a Gaussian distribution to it. What's the quality of the fit?
-Compare different groups in the data set; Urban vs. Rural - distribution of men vs women, life expectancy, % smokers etc.
Are the differences statistically significant? Are they significant in a real sense?
-Zipf's law: create a histogram of county population size.
%plot inline -s 400,400
[R, P] = corrcoef(Tnum);
sprintf([repmat('%4.0f',1,size(R,2)) '\n'], triu(R,1)'*100)
colormap(parula);
imagesc(triu(R,1))
axis square
colorbar
[r,c] = find(triu(R,1)==max(R-eye(size(R)),[],'all'))
T(1,[colmap(3),colmap(4)])
%plot inline -s 800,800
myscatter = scatter(T.x_AdultsWithObesity, T.x_PhysicallyInactive,'.r');
title("Obesity vs Inactivity")
xlabel('% Adults w/ Obesity')
ylabel('% Physically Inactive')
ax = ancestor(myscatter,'axes');
ax.YAxis.Exponent=0;
ytickformat('%2.1f')
axis tight
hold on
%Plot linear regression and 95% confidence interval
[coeff, S] = polyfit(T.x_AdultsWithObesity, T.x_PhysicallyInactive,1);
xFit = linspace(min(T.x_AdultsWithObesity), max(T.x_AdultsWithObesity), 100);
[yFit, delta] = polyval(coeff, xFit, S);
plot(xFit, yFit, 'b-', 'LineWidth', 2);
plot(xFit,yFit+2*delta,'m--',xFit,yFit-2*delta,'m--')
legend('Data','Linear Fit','95% Prediction Interval','Location','northwest')
sprintf([repmat('%6.0f',1,size(P,2)) '\n'], triu(P,1)'*10000)
Testing for Normal Distribution:
Remember our next goal:
FIPS State County LifeExpectancy x_FrequentPhysicalDistress x_LimitedAccessToHealthyFoods MedianHouseholdIncome x_Homeowners x_SevereHousingCostBurden Population x_Black x_AmericanIndian_AlaskaNative x_Asian x_NativeHawaiian_OtherPacificIslander x_Hispanic x_Non_HispanicWhite x_Female x_Rural
USAd = readtable("USA_DemographicsData.csv");
USAd(1:8,1:5)
USAd(72:78,1:4)
Can't keep avoiding our problem now:
https://www.mathworks.com/help/matlab/matlab_prog/clean-messy-and-missing-data-in-tables.html
USAd = readtable("USA_DemographicsData.csv",'TreatAsEmpty',{'NA'});
USAd(72:78,1:4)
We also should remove the 'NA' counties that are actually just state averages (how does this affect correlation from above?):
size(USAd)
USAd(strcmp('NA',USAd.County),:)=[];
size(USAd)
states=unique(USA.State);
state_inds=cellfun(@(c)strcmp(c,USAd.State),states,'UniformOutput',false);
USAdmap = containers.Map(states,state_inds);
USAd(1:8,1:5)
USAdnum = table2array(USAd(:,4:18));
USAdcolmap = containers.Map(1:15,4:18);
%plot inline -s 1000,300
histogram(USAd.LifeExpectancy)
med = nanmedian(USAd.LifeExpectancy);
xline(med,'r');
text(med,235,num2str(med))
How do we go about fitting a normal to this?
Let's take a look at MathWorks documentation to see and then test if this is a good fit.
(Good illustration of my process and how to solve open-ended problems)
Zipf's Law:
https://en.wikipedia.org/wiki/Zipf%27s_law
Remember our goal:
Order and rank county populations. Plot rank vs. population size on a log-log plot. Except for the smallest sizes they will fit a straight line. How should we go about fitting?