Index of /examples/matlab/misc/howto-checkpoint

Icon  Name                    Last modified      Size  Description
[DIR] Parent Directory - [TXT] index.html.scv-helppage 01-Feb-2014 17:51 118 [   ] chkpt.m 16-Dec-2013 15:05 853 [   ] chkin.m 16-Dec-2013 15:05 330 [   ] checkpointing_example.m 16-Dec-2013 15:05 3.5K

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% On a cluster such as SCC, jobs that take hours or days to run to completion    %
% on occasion could suffer from system abort in mid-stream due to a variety of   %
% reasons: power failure, walltime limit, scheduled shutdown, to name a few.     %
% Rerunning failed job means inevitable delay and waste of system resources.     %
% Checkpoint Restarting has long been a common technique with which researchers  %
% turn to tackle this issue. In brief, Checkpoint Restarting essentially means   %
% saving data to disk periodically so that if need be, you can restart the job   %
% where data were last saved. Checkpoint Restarting can either be dealt with     %
% through a batch scheduler (if supported) or do it the old-fashioned way by     %
% writing to disk manually yourself. On the SCC, the OGE batch scheduler         %
% checkpointing feature is not supported. A rather simple MATLAB checkpointing   %
% tool has been developed by RCS to perform manual checkpointing systematically. %
% This MATLAB example code demonstrates the usage of this checkpointing tool to  %
% facilitate rerunning of your job should a system abort occurs.                 %
%                                                                                %
% December 8, 2013                                                               %
% Kadin Tseng, RCS, Boston University  (kadin@bu.edu)                            %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Check list for checkpointing:

1) Before the start of the iteration loop, "check-in" each and every variable
   that should be checkpointed in the event of restarting the job

matfile = 'chkpt.mat';      % mandatory; name of checkpoint mat-file; your name choice
s = struct();               % mandatory; create struct for checkpointing variables
s = chkin(s,{'iter'});      % mandatory; iter is the name of the iteration loop index
s = chkin(s,{'frequency'}); % mandatory; frequency is the period of checkpointing, i.e.,
                            % how often you want to perform a save. A finer grain saving
                            % prevents less lost iteration but costs more because of it
s = chkin(s,{'Niter'});     % mandatory; total number of iterations
s = chkin(s,{'a' 'n' 'A'}); % check-in multiple items at the same time

% continue until all variables are checked in. Note that you are only
% checking in the variables, they don't need to have been already defined

chkNames = fieldnames(s);    % the full list of variables to checkpoint
nNames = length(chkNames);   % number of variables in list

2) Before the end of the iteration loop, checkpoints once every **frequency** steps

if mod(iter, frequency) == 0
   iter    % prints iteration number
   chkpt   % performs checkpointing (save) every **frequency** iterations
end

3) Make a copy of your original m-file and modify it to handle checkpoint restarting
   It should contain this . . .

   matfile = 'chkpt.mat';          % the mat-file name with which data have been checked
   load(matfile);                  % load data saved previously with chkpt.m
   chkiter = s.iter;               % last saved iteration count
   chkNiter = s.Niter;             % total number of iterations
   for iter=chkiter+1, chkNiter    % start from the next step
      % computations in original program . . .
   end

4) Notes:
   A. For this version, the chkpt.m file is written as a script, rather than function, 
   m-file. This makes it simple to implement as everything needed to be checkpointed
   are readily available as it shares the same workspace with the main program.
   Users are advised to exercise care to ensure that variables to be saved have the 
   intended values at the time it is being checkpointed. Most importantly, the struct 
   array name, declared s in the attached example, is not used for other purposes.
   B. Especially if you have a lot of variables to check-in, you may find it
   appealing to incorporate the whole check-in commands into a SCRIPT m-file to avoid
   clutter.
   C. While this checkpointing tool is designed for restarting, you may find it useful
   for other purposes; for instance for the usual output handling.