HRP 223 - Data Management and Statistical Programming - 2013/2014 Edition

 

The materials on this course website are only for the use of students enrolled in this course for purposes associated with this course and may not be retained or further disseminated. The materials on this course website may be protected by copyright; any further use of this material may be in violation of federal copyright law.

 

The goal of the course is to provide hands on instruction in data management and analysis techniques.
Topics discussed include:

  1. Working with large databases - what makes a good database turn bad
  2. Data cleaning techniques
  3. Generating numerical and graphical presentations
  4. Descriptive statistics

Contact information

Professor

Teaching Assistant

Raymond R. Balise 
Redwood Bldg. T213D, MC 5092 
Stanford, California  94305-5405 

Balise at Stanford 
Voice (650) 724-2602 
Fax (650) 725-6951

Mike Hurley

 

 

 

Prerequisites

Admission to Health Research and Policy and a comfortable knowledge of a Windows XP/Vista/7/8.

Lectures                                                                                                             

Monday and Wednesday 11:30 -1:00 in LK208 starting September 23th.

Office Hours

By appointment in Redwood Building T213D. Directions can be found here: www.stanford.edu/~balise/FindBalise.htm

Newsgroup

There is a newsgroup form on the coursework site.

Readings

The Little SAS Book for Enterprise Guide 4.2.

Common Statistical Methods for Clinical Research with SAS Examples

 

Read about SQL vs. data step code here:

http://www2.sas.com/proceedings/sugi31/250-31.pdf

 

Read about numeric precision here:

http://support.sas.com/techsup/technote/ts654.pdf.

 

Grading

Grades will be based on four homework problem sets.   If you take the course for 2 units you must pass at least three of the four homework assignments and you must not violate the virus policy below.  If you take the course for 3 units you must pass all four assignments.  There will be many quick assignments that will not directly affect grades.

Turning in Homework and Viruses

All assignments and homework will be submitted via the coursework website. Any student that sends me or the TA a virus (or any other malicious code) will fail the course.  There will be no exceptions made.  Therefore, you are strongly advised to download the latest version of the Sophos Anti-Virus software. If you need virus protection check here http://www.stanford.edu/services/ess/ and you can download the software for free. If you have any questions ask!

Late policy

Each of the assignments will be due at the beginning of class on the day specified.

That said, there are unforeseen emergencies (illness, bike accidents, disk crashes, network troubles, childbirth, etc.). Instead of having to ask for special allowances on an individual basis, I give each of you the privilege of granting yourself a small extension in case of crisis. You will have two late days which you may use to extend the due dates of any assignments without penalty. To avoid any ambiguity, there are seven days in a week and each day ends at 5:00 PM. Thus, if your assignment was due on Wednesday but turned in the following Monday before 5:00, that assignment would be five days late. After the grace period is up each assignment is down weighted 20% per day.  In all cases, assignments will not be accepted more than one calendar week after the original assignment due date.

Computer Platforms

The programs that you turn in must run on Windows SAS 9.4 and/or Enterprise Guide 6.1.  I can provide good support for Windows or a Mac running parallels (http://www.parallels.com/).

Setting up and configuring SAS and SAS Enterprise Guide:

Pictures showing my SAS 9.4 TS1M1 and EG 6.1 install can be found here.

Notes on configuring SAS Enterprise Guide 6.1 will be found here.

Notes on using an IRT Server for SAS can be found here. Be sure to set up you EG using the notes here.

 

HRP 223 the movie!

You can see the class in color by clicking here.

Core Lecture Material

Note: All dates are approximate and subject to change except HW 4.

Preliminaries of Computing at Stanford and an Introduction to SAS Enterprise Guide (Sept 23st)

Reading: The tutorials A, B, and C in TLSB for EG.

 

Software somebody should have told you about a long time ago

Essential Stanford Software (free stuff)

Tools of the trade

Excel

REDCap

SAS or SAS/Enterprise Guide

R and R Commander

Deducer

Other software I use (not officially endorsed by anybody and ask your IT security person for help)

CrashPlan

Notepad++

UltraEdit

UltraCompare

FileLocator Pro

Using SAS Enterprise Guide as a calculator

 

The PowerPoint slides are here.

Assignment 0 is here. It should be done before class on September 25th.

Data and Data Collection (Sept 25th ֠- Sept 30th)

Using your UNIX space

How

What is a database?

Critical registry teaks for Excel

How to organize data in Excel

            Variable names

            Dummy records

Using Excel

            Making tables

            Validation

            Formulas

            Quick counts

Collecting data in general

            How to score and store answers

            The value/danger of redundancy

            Using REDCap

                        What is it?

                        How to set it up.

 

The PowerPoint slides are here.

The sample .htaccess file is here. (To save the file you may need to right click on the hyperlink and choose something like save Target as or save Link as …)

An Enterprise Guide project to easily load data from REDCap into SAS 9.3 is here or into 9.4 is here.

 Assignment 1 is here . It should be done before class on Sept 30.

Assignment 2 is here. It should be done before class on Oct 2rd.

Working with Data in Windows (Oct 2rd - Oct 7th)

  Reading: Chapters 1, 2 and 4 in TLSB for EG. Skim chapter 3.

                       

How SAS works in Windows

Types of Files / Suffixes

Libraries vs Folders

            Loading data from Excel

            Simple subsets

            Creating new variables

            Converting character variables to numeric with the input function

            Basic numeric summaries

Making better summary tables

 

The PowerPoint slides are here.

An Excel 2007 spreadsheet showing mixed type issues is here and the EG project is here.

An Excel 2007 spreadsheet with fake drowning data is here and the EG project is here.

 

Homework 1 is here and is due Oct 16th.

Windows and EG configuration (Oct 9th)

            Setting up Windows

                        File extensions

                        Folders

                        Office

                        DOS   

            Configuring SAS Enterprise Guide

                        Application options

                        Keyboard macros

            Using SAS

                        Libraries

                        Subsets

                        Toy data sets

                        Loading real data sets

                        Simple descriptive statistics and graphics

           

The PowerPoint slides are here.

Beginning Data Manipulation (Oct 9th-Oct 16th)

Reading: TLSB for EG 4.2 chapter 4.

 

Procedures vs Functions

Code for data step

            Labels

            Formats

Code for SQL

            Labels

            Formats

GUI for SQL

Process flow sheets and project organization

Autoexec flow sheets

Creating Formats with the GUI

Data step if-else

            Data step select-when

            SQL case-when-then

            Showing combinations

 

The slides are here and the completed project is here.  (Open the project after the class.) 

Inventing and tweaking data (Oct 16th)

Reading: TLSB for EG 4.2 chapters 3, 4 and 9.

 

Creating data

Recoding values to make data

Project organization

Fixing bad values

            Using formats

 

The PowerPoint slides here.

An Excel spreadsheet is here and an EG project is here.

Assignment 3 is here and it will be discussed Oct 28th.

Homework 2 is here and the solution is here.

Merging with SQL (Oct 28th)

Reading: Tutorial D TLSB for EG 4.2 chapters 4 and 6.

 

SQL joins

            Inner joins

            Left joins

The coalesce() and coalescec() functions

 

The slides are here.

An Enterprise Guide project that uses SQL to merge files is here.

Assignment 4 is here.

Lab 2 (Oct 30th)

Reading: Tutorial D TLSB for EG 4.2 and chapters 4 and 6.

           

Joins

            Summary functions with SQL

            Fixing bad values

                        By using functions like left

                        By recoding

Finding discrepancies when matches fail

 

The tasks for today are here.

More SQL and querying voodoo is covered in a project here.

This assignment uses information in an Excel workbook that can be found here.

The partially complete project is here.

Combine with SQL (Oct 30th)

Reading: Tutorial D TLSB for EG 4.2 and chapter 4 and 6.

 

            SQL to select subsets of records from multiple tables

                        Outer union and/or outer union corresponding concatenate

Union unique rows from both tables

Except rows that are part of the first query

Intersect rows common to both queries

 

The slides are here.

The Enterprise Guide 5.1 file for today is here and the 6.1 version is here.

Macro Language (Oct 30th / Nov 4th)

EG prompts

macro variables

            %macro %mend

            Examples of macros

                        Dot/bar plot

                        Quantile plot

                        Calculating sensitivity, specificity, PPV, NPV

                        Binomial probabilities

                        Checking survival data

             

The slides are here.

An Enterprise Guide project showing macros is here.

Homework 3 is here and the solution is here.

Idiosyncratic SAS stuff like formats and Grouped Data (Nov 6th)

            proc format library = blah

options fmtsearch = (blah work);

proc cport memtype = all;

proc cimport

options nofmterr;

By processing

            Retain statements

            Finding the labs closest to an event

           

The slides are here.

An Enterprise Guide project showing working with grouped data is here.  

Assignment 5 is here.

Database issues (Nov 11th)

            Queries and corrections when data is not normalized

            Normalizing data with arrays and loops in a data step

 

The slides are here.

An Enterprise Guide project is here for EG 5.1 or here for EG 6.1.

A dataset is here.

 

Output Delivery System (Nov 13th)

            Limiting out output from SAS

How SAS creates different types of output

Writing output to different file types

SAS output table names

 

The slides are here.

An Enterprise Guide project is here.

My SAS header file is here.

Graphics Data (Nov 18th, Nov 20th, and Dec 2nd)

Reading: TLSB for EG chapter 10

 

What makes a bad graphic bad

            The gurus of data visualization

            A good graphic

                        Sgplot vbox inset

                        styleattrs datacolors=(black) datacontrastcolors=(grey grey grey red) datasymbols=(circle circle circle circlefilled);

            Visualizing categorical data

                        Bar charts

                        Dot plot

            Visualizing continuous data

                        Histograms

                        Box plos

                        Violin plot

            How to get image formats for publication

            Working with the SAS ODS

            ODS Statistical Graphics

            SAS Graphics Editor

                        ods listing sge = on;

            Setting up R

                        Tinn-R

                        PERL 

            High quality SAS graphics

                        Sgplot hbox / category =

                        Sgplot scatter

                        Sgplot reg / group =

                        Sgplot histogram

                        Sgpanel panelby histogram

                        Sgpanel panelby scatter

            Lattice plots in SAS

            Ods graphics on

            Selecting a subset of ODS graphics

            ODS statistical graphics with many elements

            Truly custom ODS statistical graphics 9.3

                        proc sgplot data = plot tmplout="c:\blah\template.sas";

                        ods path (prepend) work.template (update);

            Custom appearance/style template:

                        proc template;

                                    define style sexE;

                                    parent = styles.Statistical;

                                                style graphdata1 / contrastColor=pink markersymbol = "star";

                                                style graphdata2 / contrastColor=blue markersymbol = "plus";

                                    end;

run;

ods listing style = sexE;

            Making big pictures:

goptions hsize=11in vsize=8.5in;

ods graphics / width=11in height=8.5in;

 

                       

The slides are here. (Do not print the entire set)

Enterprise Guide project is here.

 

The slides are here for the 20th (Do not print the entire set)

Enterprise Guide project is here for the 20th.

 

The slides are here for the 2nd.

Enterprise Guide project is here for the 2nd.

 

 

SAS code showing ods graphics can be found here.

SAS code showing setting colors in graphics with style templates is here.

R code for common R graphics is here along with data sets here and here.

R code for categorical data is here.

Final Summary (Dec 4th)

The slides are here.

The example program that shows the getLabel macro is here.

Homework 4 is here and is due before 11:31 December 13, 2013.

Other stuff

SAS 2013 keyboard macros can be found here.

A set of useful links can be found here.

Other Books

(You can get many of these from the Stanford library but preview them on Amazon or books.google.com)

A few of my old favorite books are listed here

My favorite scientific visualization books are here.