Hello, Population Health Science Member. Welcome to PHS!
The Data Core has created this wiki to help you navigate accessing and using our various PHS datasets. We are excited to see so many users interested in our data and we hope to provide you with a small toolkit to get started. The best way to explore the PHS datasets is to go to the PHS Data Portal (powered by Redivis). This is the place to apply for data access, explore the datasets and do your first cut of the data (i.e. creating a smaller dataset, only containing the information you need to complete your analysis). When you have created your custom made dataset, you can download it to one of our server environments for further statistical analysis.
- PHS is a community of users – This means we all share space and agree to fair server use. Details about this are described below and will ensure an efficient and productive research environment. But for now, please (a) monitor your storage on the Data Portal and on the server, and (b) don’t forget log out (see your server User Guide for information) from the server
- Data Access, Not Data Download – Data and documentation must stay securely stored on PHS-approved machines. At this time, you are permitted to transfer aggregated data tables only from PHS-Win or PHS-Linux to your own computer. Please revisit the Data Use Agreement (DUA) you signed and submitted to PHS for more information.
The Stanford Research Computing Center (SRCC) has set up two different server environments for PHS data: Windows and Linux. Depending on the type of data that you have applied for, you may be directed to one or the other. But in general, choose the server environment you are most comfortable with.
You are sharing storage space and bandwidth with other PHS users, no matter which server environment you are using. This means that all the processes you run and data you store impact the available bandwidth and space for all other PHS users.
We hope the tips below will help everyone optimize the PHS resources and enable a respectful and productive community of researchers.
2. Access to user and data repositories
- Here you find our guide to how to access PHS data. Once you have access, here is how to get started on the PHS Data Portal.
Duo Authentication – This is an app that allows Stanford to confirm your identity and open access to your applications and files. Download the app and check your phone whenever accessing Stanford hosted servers. You will not get any warnings/notifications from your computer. Duo Authentication is also known as “two factor authentication” or “two-step authentication”.
https://accounts.stanford.edu/ -> Manage -> Two-Step Auth
- Remote Desktop: When you receive access to the Windows server, you will need to install software to allow remote connection. Instructions for connecting to the Windows server can be found here.
- Datasets and Documentation: Specific “Read.Me” files and data documentation will be stored under the directory: S:\data. We strongly encourage new users to read the “Read.Me” documentation to understand how folders are organized for each dataset.
- Folder and File Access: Once you connect to the Windows environment, you will be allowed to access the specific dataset you applied for, as well as your specific user folder.
- Individual – This folder will be named after your SUNet and will be found under: S:\users\[SUNet-ID].
- Group – If you would like a group folder created, please email firstname.lastname@example.org with the name of your PI. All group folders are created and stored under: S:\group\[SUNet-ID]
- Remote connection: Upon receiving access to the Linux server, you will need to install software to allow for remote connection. Instructions for connecting to the Linux server can be found here.
- Navigating Linux Interface: If you have never worked on a Linux server before, the interface may seem a bit different than what you’re used to seeing on a regular Windows environment. Once you connect to Linux, you’ll see the following screen:
3. Folder and File Access: You will have access to (a) your user folder and (b) the specific dataset you applied for.
- Individual – As seen in the screenshot above, your individual folder will be named: [SUNet]’s home.
- Group – If you would like a group folder created, please email email@example.com with the name of your PI. All group folders are created and stored under: Filesystem” > “mnt” > “phs” > “group”
PLEASE NOTE: Only system administrators are allowed to change permissions or create new Shared folders.
- AMC – If you have been granted access to the AMC data, you can simply click on the shortcut that says “AMC” on the desktop.
- Other – Click on “Computer” and then click on “Filesystem” > “mnt” > “isilon” to select other datasets.
PLEASE NOTE: Once you hit your quota limit, you will be unable to access the server and delete any files. You will have to contact an administrator to kill jobs you are currently running.
The parameters for PHS data storage were developed to enable optimal performance for users.
- Individuals: Your storage defaults to a hard limit of 10GB, whether you’re working on Windows or Linux.
- Group: If your research involves at least two individuals, we recommend that you or your PI request a group folder to be created on the repository S:\groups on Windows, or /mnt/phs/group on Linux, which will allow your group’s storage to default to a hard limit of 100GB.
As you proceed with coding, keep an eye on your storage space by right-clicking on your individual user or group folder, and selecting “Properties”:
4. Analyzing the data
As you probably already know, the process for gaining and using our datasets can be summarized as follows:
How to analyze the data – Start with a 1 % sample on the Data Portal (Redivis)
Most of PHS data are really large datasets. If you have never worked with “big data” before, it’s not a bad idea to get started with working off a 1% sample of the data. That’s already plenty of records to work with. Truven’s 1% sample, for example, already contains records for over 1 million individuals. A 1 % sample allows researchers to explore variables, analytic possibilities and to refine code before working with large data sets.
We ask that you start with a sample subset of the data in order to
- Select only the data you need (we have a limit on 1 GB per dataset you create on the Data Portal and want to download to the server);
- Avoid hitting your storage quota and freezing access to your environment;
- Prevent saturation of resources on the server; and,
- Allow everyone to access the server and proceed with their work.
The samples are made available in SAS, Stata. However, keep in mind that when using the full dataset, you will have to do a rough selection of the data in SAS before you can transfer this rough cut into a language of your choice. We provide SAS sample code that can be found under S:/data/Sample code.
PLEASE NOTE: when using the full dataset, you will have to do a rough selection of the data in SAS before you can transfer this rough cut into a language of your choice. Starting this summer, the full dataset will only be available on the Data Portal.
Run analyses on the full sample
PLEASE NOTE: the full dataset will not be available on the server, but only on the Data Portal, after the summer.
Once you have perfected your code using the 1% sample, you will have to pre-select your cohort of interest using some rough criteria of diagnoses, procedure codes and enrollment criteria. This will allow subsetting a chunk of data of manageable size, that you can then transform into your preferred programming language format, using a software called STAT Transfer and available on our server. After that, you can include some of the sample code you had created with the 1% sample data to proceed to further refining your cohort and eventually to your statistical analyses, using the programming language of your choice.
- Pre-select your cohort using criteria identified during sample exploration. This reduced the dataset to a manageable size with relevant information.
- Use STAT Transfer to convert your data into your preferred programming language. STAT Transfer is available on all PHS servers.
- Use sample code from your 1 % sample to further refine your cohort
- Run your statistical analysis. All analysis should be completed on the PHS servers. Please do NOT download data to your personal computer.
5. Extracting summarized data
PLEASE NOTE: As stated in the Data Use Agreement you signed and submitted to PHS, data and documentation must stay securely stored on PHS-approved machines. At this time, you are permitted to transfer aggregated data tables only from PHS-Win or PHS-Linux to your own computer.
6. Loading new packages (Stata & R)
If you need additional packages to run your code, send us an email with the name of the package, what it is for and if possible, a link to where it can be downloaded. We will take care of it as soon as possible.
PLEASE NOTE: you can save this short code on your profile.do so that you do not need to re-run these commands at the beginning of each Stata session. Otherwise, you may need to re-install the package(s) each time you open Stata.
7. Logging out
As mentioned above, both PHS Windows and Linux servers are shared environments. Please close any applications you are not actively using and logout of the server when finished with your work. Please refer to our Windows or Linux user guides to find out how to properly log out.
PLEASE NOTE: Disconnecting and logging out of the servers are two different things! Simply closing the Remote Desktop windows does not mean that you have exited the Windows or Linux environments. Your session will stay active and running and using resources until you log out.