Getting started

 

1. Introduction

2. Access to user and data repositories

3. Storage

4. Analyzing the data

5. Extracting summarized data

6. Loading new packages (Stata & R)

7. Logging out

 

 

1. Introduction

 

Hello, Population Health Science Member. Welcome to PHS! 

The Data Core has created this wiki to help you navigate accessing and using our various PHS datasets. We are excited to see so many users interested in our data and we hope to provide you with a small toolkit to get started.

 

PLEASE NOTE:

  • PHS is a community of users – This means we all share space and agree to fair server use. Details about this are described below and will ensure an efficient and productive research environment. But for now, please (a) monitor your storage, and (b) log out (see your server User Guide for information)
  • Data Access, Not Data Download – Data and documentation must stay securely stored on PHS-approved machines. At this time, you are permitted to transfer aggregated data tables only from PHS-Win or PHS-Linux to your own computer.  Please revisit the Data Use Agreement (DUA) you signed and submitted to PHS for more information.

 

The Stanford Research Computing Center (SRCC) has set up two different server environments for PHS data: Windows and Linux. Depending on the type of data that you have applied for, you may be directed to one or the other. But in general, choose the server environment you are most comfortable with. 

You are sharing storage space and bandwidth with other PHS users, no matter which server environment you are using. This means that all the processes you run and data you store impact the available bandwidth and space for all other  PHS users.

We hope the tips below will help everyone optimize the PHS resources and enable a respectful and productive community of researchers.

 

2. Access to user and data repositories

 

  • Please visit our page phsdata.stanford.edu and locate your dataset of interest, by navigating to the dataset tab
  • Click on the tile that contains the dataset you want to use
  • Click on the tab that says “Access Data”: you will see the list of trainings and certifications you need to provide in order to use our data. Alternatively, click “Use Data” button and select “View Access Requirements”

 

Duo Authentication – This is an app that allows Stanford to confirm your identity and open access to your applications and files. Download the app and check your phone whenever accessing Stanford hosted servers. You will not get any warnings/notifications from your computer. Duo Authentication is also known as “two factor authentication” or “two-step authentication”.

https://accounts.stanford.edu/ -> Manage -> Two-Step Auth

 

Windows 

  1. Remote Desktop: When you receive  access to the Windows server, you will need to install software to allow remote connection. Instructions for connecting to the Windows server can be found here.
  2. Datasets and Documentation: Specific  “Read.Me” files and data documentation will be stored under the directory:  S:\data. We strongly encourage new users to read the “Read.Me” documentation to understand how folders are organized for each dataset.
  3. Folder and File Access: Once you connect to the Windows environment, you will be allowed to access the specific dataset you applied for, as well as your specific user folder.
  • Folders:
    • Individual – This folder will be named after your SUNet and will be found under: S:\users\[SUNet-ID].
    • Group – If you would like a group folder created, please email phs-computing@stanford.edu with the name of your PI. All group folders are created and stored under: S:\group\[SUNet-ID]

Linux

  1. Remote connection: Upon receiving access to the Linux server, you will need to install software to allow for remote connection. Instructions for connecting to the Linux server can be found here.
  2. Navigating Linux Interface: If you have never worked on a Linux server before, the interface may seem a bit different than what you’re used to seeing on a regular Windows environment. Once you connect to Linux, you’ll see the following screen:  

 

 

 

     3. Folder and File Access: You will have  access to (a) your user folder and (b) the specific dataset you applied for.

  • Folders:
    • Individual – As seen in the screenshot above, your individual folder will be named: [SUNet]’s home.
    • Group – If you would like a group folder created, please email phs-computing@stanford.edu with the name of your PI. All group folders are created and stored under: Filesystem” > “mnt” > “phs” > “group”  

 

PLEASE NOTE: Only system administrators are allowed to change permissions or create new Shared folders.

 

  • Files
    • Alcoa – If you have been granted access to the Alcoa data, you can simply click on the shortcut that says “alcoa” on the desktop.
    • Other – Click on “Computer” and then click on “Filesystem” > “mnt” > “isilon” to select other datasets.

 

back to top

 

3. Storage

 

PLEASE NOTE: Once you hit your quota limit, you will be unable to access the server and delete any files. You will have to contact an administrator to kill jobs you are currently running.

 

Quotas

The parameters for PHS data storage were developed to enable optimal performance for users.

  1. Individuals: Your storage defaults to a hard limit of 10GB, whether you’re working on Windows or Linux.
  2. Group: If your research involves at least two individuals, we recommend that you or your PI request a group folder to be created on the repository S:\groups on Windows, or /mnt/phs/group on Linux, which will allow your group’s storage to default to a hard limit of 100GB.

 

Monitoring

As you proceed with coding, keep an eye on your storage space by right-clicking on your individual user or group folder, and selecting “Properties”:

Windows:

Linux:

back to top

 

4. Analyzing the data

 

As you probably already know, the process for gaining and using our datasets can be summarized as follows:

 

 

How to analyze the data –  Start with a 1 % sample

Most of PHS data are really large datasets. If you have never worked with “big data” before, it’s not a bad idea to get started with working off a 1%  sample of the data. That’s already plenty of records to work with. Truven’s 1% sample, for example, already contains records for over 1 million individuals. A 1 % sample allows researchers to explore variables, analytic possibilities and to refine code before working with large data sets.

We ask that you start with a sample subset of the data in order to

  • Select only the data you need;
  • Avoid hitting your storage quota and freezing access to your environment;
  • Prevent saturation of resources on the server; and,
  • Allow everyone to access the server and proceed with their work.

 

The samples are made available in SAS, Stata. However, keep in mind that when using the full dataset, you will have to do a rough selection of the data in SAS before you can transfer this rough cut into a language of your choice. We provide SAS sample code that can be found under S:/data/Sample code.

 

The samples are made available in SAS, Stata [and R].We hope that this will allow you getting used to viewing and describing the data using your preferred programming language.

PLEASE NOTE: when using the full dataset, you will have to do a rough selection of the data in SAS before you can transfer this rough cut into a language of your choice.

 

 

 

Run analyses on the full sample

Once you have perfected your code using the 1% sample, you will have to pre-select your cohort of interest using some rough criteria of diagnoses, procedure codes and enrollment criteria. This will allow subsetting a chunk of data of manageable size, that you can then transform into your preferred programming language format, using a software called STAT Transfer and available on our server. After that, you can include some of the sample code you had created with the 1% sample data to proceed to further refining your cohort and eventually to your statistical analyses, using the programming language of your choice.

 

 

      1. Pre-select your cohort using criteria identified during sample exploration. This reduced the dataset to a manageable size with relevant information.
      2. Use STAT Transfer to convert your data into your preferred programming language. STAT Transfer is available on all PHS servers.
      3. Use sample code from your 1 % sample to further refine your cohort
      4. Run your statistical analysis. All analysis should be completed on the PHS servers. Please do NOT download data to your personal computer.

 

back to top

 

5. Extracting summarized data

 

PLEASE NOTE: As stated in the Data Use Agreement you signed and submitted to PHS, data and documentation must stay securely stored on PHS-approved machines. At this time, you are permitted to transfer aggregated data tables only from PHS-Win or PHS-Linux to your own computer. 

 

6. Loading new packages (Stata & R)

 

If you need additional packages to run your code, send us an email with the name of the package, what it is for and if possible, a link to where it can be downloaded. We will take care of it as soon as possible.

 

PLEASE NOTE: you can save this short code on your profile.do so that you do not need to re-run these commands at the beginning of each Stata session. Otherwise, you may need to re-install the package(s) each time you open Stata.

back to top

 

7. Logging out

 

As mentioned above, both PHS Windows and Linux servers are shared environments. Please close any applications you are not actively using and logout of the server when finished with your work. Please refer to our Windows or Linux user guides to find out how to properly log out. 

 

PLEASE NOTE: Disconnecting and logging out of the servers are two different things! Simply closing the Remote Desktop windows does not mean that you have exited the Windows or Linux environments. Your session will stay active and running and using resources until you log out.

 

back to top