Robust workflow for replicability and reliability with Eclipse + StatET

SPSS and Office are easy to learn but lack the power and extensibility of R, SQL, LaTeX/Sweave, BibTeX, and Subversion (SVN).  These open-source technologies were developed based on peer-reviewed code and are designed to facilitate replicability, but they do not just work 'out of the box.'  This page contains instructions for how to configure these technologies and set up a workflow for data management, analysis, and word-processing/typsetting using these tools.  Though I pay special attention to Mac OS, PC/Linux users should be able to implement everything below.

The steps are roughly as follows:
1. Install R.
2. Install LaTeX (long download).
3. Install Eclipse.
4. Install Eclipse plug-in StatET and configure.
5. Configure Sweave and pgfSweave.
6. Install Subversion (SVN) and Subclipse.
7. Install Zotero and configure it to make it play well with BibTeX.

This should take about an hour or so, depending on your Internet connection, distractions, level of familiarity with these kinds of tools etc. 

R for Data Analysis

First, install R. If you don't already know R, learn it asap.  There are some excellent links to introductory videos and texts on the panel to the right ======>>>

Install R from this site:
http://cran.r-project.org/

One common complaint about R is that it is RAM-hungry.  This is true, but the user can easily use a SQL database such as SQLite with R using the SQLiteDF package to augment its data processing capabilities.  Another solid option is R package Bigmemory, which is designed to help users analyze datasets > 10 gigabytes. See the Bigmemory vignette for details

If you do large-scale text analysis, check out the tm package, which is set up to automatically database term document matrices and offers parralellization via Rmpi.  It also plays well with topicmodels, which implements Latent Dirichlet Allocation (LDA) and correlated topic models (CTM).

LaTeX (pronounced 'Lay-Tek') for beautiful typsetting

Get a TeX distribution from one of the following sites:
Mac: http://www.tug.org/mactex/
Windows: http://miktex.org/

Eclipse, the editor (for just about everything)

Eclipse is a cross-platform open-source editor based on Java and originally developed by IBM in 2001.  It provides provides an integrated development environment (IDE) meaning that it provides a source code editor, compiler or (more critically for our proposes) an interpreter, build automation tools, and a debugger (though debugging must still be done in R and LaTeX via the command line).  Eclipse also features an an extensible plug-in system.  Among other things, it was designed with visual programming in mind, meaning that it assists programming tasks by representing code elements visually, and sometimes providing the capability to allow programmers manipulate program elements graphically instead of merely via text (code). 

Eclipse is one of the most widely used editors/IDEs in existance. One survey suggests that Eclipse is the third most-used IDE in existence (second to MS-Visual Studio and Adobe Macromedia Studio).  It easily the most widely used open source IDE, and some predict that because of its rapid growth and development it will eventually rise to the number-one spot.  What this means is that the codebase is robust and stable. 

Why not just use the default R editor?  Well, in addition to Sweave functionality, Eclipse provides a connection to R with shortcut keys, R & Sweave syntax highlighting, hover functionality, an outstanding graphical object browser, an outline that links to new objects and functions declared in your code, and it almost never crashes, so when (not if) R crashes, you don't lose your R code.  It also has find and replace function with RegEx support, toggle code commenting (command-shift-C or control-shift-C), content assist, and other very nice features.  Though it is a resource-intensive IDE, it is easier to learn than Emacs or Aquamacs.

Desktop

Install the latest version of Eclipse, which we will also use as an R/LaTeX/Sweave editor:
http://www.eclipse.org/downloads/
(you can download any version, I use the Eclipse IDE for Java Developers).

Also, you may wish to edit the configuration file to speed up Eclipse.

"Local history": Robust local version control within Eclipse

One thing I recommend immediately upon downloading Eclipse is to set its local history size to unlimited. This means that Eclipse tracks your changes to each file in your project, each time you save it.  Go to Preferences => General => Workspace => Local History and un-check "Limit history size."

If you need to compare your current document to a previous document, simply right-click (or Ctrl-click) on the file and select "Compare with" => "Local history..."  Each saved revision to the document will then appear in a side window.  Double click on each one to compare the differences between the current and saved document.  This works great when you accidentally save/write a file over the one you previously were using (i.e., for your dissertation/journal manuscript), because you can simply use compare with local history to restore all of your previous hard work.

Also, if you deleted a file by accident, you can right click on the project folder and select "Restore from Local History..." and restore your deleted file. 

The StatET plugin for Eclipse

You'll want to install the StatET plugin for R:
http://www.walware.de/goto/statet

Be sure to install the rJava and rj packages for R:
From R, type

install.packages("rJava")
install.packages("rj", repos="http://download.walware.de/rj-0.5")

Install StatET from Eclipse, by selecting Help ==> Install New Software...  and pasting the appropriate url into the "work with:" prompt:

http://download.walware.de/eclipse-3.7 [or your version of Eclipse]

Once you install StatET, you'll want to run the cheatsheets to optimize your Eclipse configuration. From Eclipse, go to Help => Cheat Sheets... and click on the StatET folder.  Run the cheat sheets in order.

Luke Miller has a nice step-by-step walkthrough of this setup on his blog.

Longhow Lam has a short book on the StatET plug-in for Eclilpse here.

Basics and Shortcuts

Now that you've run through all of the cheat sheets, StatET should be the default perspective and the R console should run automatically.  You might want to start by making a new R project.  If you want to import an existing project, just make a new project with the same folder name, and copy the folder into your workspace. 

You can send R code that you write to the R console by pressing 'command+R, command+R' - that's command+R twice. 

A full list of shortcut keys is available by clicking 'command+shift+L, command+shift+L'.  I recommend changing the assignment (' <- ') shortcut to 'command + shift + , ' or something similar and changing the add docu comment (' ## ') shortcut to something like 'command + / '.  [The default shortcut combinations seem not to map to any existing keys on my keyboard...]. 

One more useful shortcut key: content assist: 'ctrl+SPACE'
 

Writing LaTeX documents in Eclipse with Texlispe

StatET also installs Texlipse, which is rather nice way to compose latex documents, especially if you are collaborating with others.  You'll need to set up a latex project first, which can be done via File, New....  Each time you save the .tex file in the project, Eclipse will compile the latex file for you.  If you want to change the name of the output file, right-click on the project folder, select Properties, select Latex Project Properties from the menu on the left. 

One you have a latex project going, you'll want to tweak some of the settings to get the most out of Ecilpse. Here's how:

Let's take care of spelling first. 
ENABLE SPELL CHECKING for LaTeX:
Open Eclipse Preferences, select General, Editors, Text Editors, Spelling, and make sure spelling is enabled.  You'll also need to specifiy a dictionary.  I use this one:

Dictionary.txt

You can just download this dictionary into some directory (e.g., the main Eclipse directory), then point to it in the "User defined dictionary" dialogue box.

Now from the Preferences window, go to Texlipse, Spell Checker, and make sure built-in spell-checker is selected.

You can now press 'Command + 1' ( or 'Ctrl+1' on a PC) on a word with a squiggly red line under it to pull up a menu to either correct the word or add it to your dictionary.

Now LINE-WRAPPING.  You can either hard-wrap your lines or soft wrap them.  For hard-wraping, which I recommend if you are collaborating with others (it will be easier to see edits when you are comparing changes in svn), from Preferences go to Texlipse, Editor and select 'Use hard wrapping.'  Eclipse will insert a return character when reach the number of characters you specify in the 'Number of characters in line (10-1000) dialogue box. 

If you edit something and the line wrapping gets screwed up, just press Esc Q or select Latex > Correct Line Wrap, and eclipse will make everything pretty for you again. 

For soft-wrapping, I recommend the Ahtik plug in.  In Eclipse, go to help ==> Install New Software... and use the following as the url in the "work with:" prompt:

http://ahtik.com/eclipse-update/

Easy TABLE CREATION is facilitated by Eclipse's 'LaTeX Table View.'  Make sure it is visible first.  Click on Window, Open Perspective, and select Other.  Select 'LaTeX' from the menu and hit OK.  If you do not see LaTeX Table View on some tab within Eclipse, go to Window, Show View, and select LaTeX Table View.  You can now create tables in this spread-sheet-like editor, right-click and select 'Export to clipboard,' and it will format your table with &'s and nice spacing.  If you want to edit a table you've produced previously or in R (say via xtable()), right click on the LaTeX Table View editor, and select 'Import selected lines from editor.'  When you are done just export to clipboard as previously described.

Sweave & pgfSweave

Sweave let's you do your data analysis and writing all in the same place.  I prefer pgfSweave, which caches your analysis and graphics, so that you are not running that 30 minute bayesian estimation proceedure each time you want to make a small cosmetic change to your document and see the results in pdf.  It also matches the font in R graphics to whatever you are working with in your document, and provides R syntax highlighting for any R code in your document - see the pgfSweave vignette here for details.  First install pgfSweave in the R terminal:

install.packages("pgfSweave")

To get pgfSweave working in Eclipse, go to Run => External Tools ==> External Tools Configurations, and double-click on 'Sweave Document Processing'.  Where it says 'Run command in active R Console:' replace the existing Sweave text with the following:

library(pgfSweave)
pgfSweave(file = "${resource_loc:${source_file_path}}")

Now is also a good time to make sure that your Sweave/pgfSweave build tools are set up correctly, by navigating to the LaTeX tab and clicking on the blue underlined link that reads "Setup build tools..."  If the various tools have directories listed, you're golden.  If not, type the location of your TeX distribution where it reads "Bin directory of TeX distribution:" which on a mac should be as follows: 

/usr/local/texlive/[current tex year]/bin/universal-darwin/

[e.g.] /usr/local/texlive/2010/bin/universal-darwin/

You may also need to add this directory to your system path (I had to).  To do so (on OS 10.6 Snow Leopard) open the terminal and type the following:

sudo pico /etc/paths

You can add this line via the pico editor. 

You may also want to tell Mac's Finder to display all files. To do so enter the following in a terminal window:

defaults write com.apple.Finder AppleShowAllFiles YES

You have to restart finder for it to work. Hold option, click and hold the finder icon and select relaunch. (Thanks to Sean Westwood for this hint).

To use pgfsweave, you'll want to add the following lines to any Sweave (.Rnw) document:

\usepackage{Sweave}
\usepackage{tikz}
\usepackage{pgf}

THE SWEAVE.STY ISSUE - the most frustrating Sweave issue when you are just starting out - Sweave/LaTeX for no good reason cannot find the 'Sweave.sty,' latex package (the file is just in your ~/R/R-x.x.x/share/texmf directory).  Do yourself a favor and just download this copy of Sweave.sty and put it in your TeX distribution folder or the current directory you're working in:

Sweave.sty

You can also use R itself to call LaTeX such that you use R to generate your pdf directly.  You can do this via Run => External Tools ==> External Tools Configurations, and double-click on 'Sweave Document Processing', click on the LaTeX tab, then select the option for "Build tex file using the R command:".  The R command should be filled in for you. 

You can do this by using the option 'Build tex file
using the R command' in the LaTeX tab of the Sweave
document processing profile

ENABLE SPELL CHECKING in Sweave/pgfSweave:
Open Eclipse Preferences, select StatET, Source Editors, Sweave Editor and click on "Enable spell checking."  Where it says "Note: On the Spelling preference page..." click on Spelling.  This will take you to the main spelling preferences page. You'll need to specifiy a dictionary. I use this dictionary:

Dictionary.txt

You can just download this dictionary into some directory (e.g., the main Eclipse directory), then point to it in the spelling preferences page.

ENABLE LINE WRAPPING in Sweave/pgfSweave:
For now the easiest way to enable line wrapping is by installing the Ahtik plug in.  In Eclipse, go to help ==> Install New Software... and use the following as the url in the "work with:" prompt:

http://ahtik.com/eclipse-update/

Examples - Sweave & pgfSweave

Often the best way to learn something like this is by looking at examples.  Below I've posted example code for a few projects in Sweave and pgfSweave:

APSAPoster5.Rnw - (Bias in the Flesh Poster (PDF), presented at the American Political Science Association 2009).

BIAS-AMP-ICA5.Rnw - (Bias in the Flesh writeup (PDF), presented at the International Communication Association 2010, DO NOT CITE THE .RNW FILE).

pgfSweaveEXAMPLE.Rnw - (prelim phase of an NLP project (PDF), DO NOT CITE).

Subclipse (easy to use Subversion interface in Eclipse)

Subversion is version control software, great for collaborating on large projects.  In short, it keeps track of every change made to a series of files and is generally an excellent way to prevent data loss and track changes between documents. 

First you need subversion 1.6 with java bindings if you want to use it with Eclipse:

MAC: go to collabnet and download subversion 1.6 for OS *10.6* (second option), which has the right java bindings. You'll have to sign up for access, but it's quick and painless.
http://www.open.collab.net/downloads/community/

PC: TortiseSVN may be best for windows. Install guide here: http://www.woodwardweb.com/java/howto_configure.html.

I recommend Subclipse, which seems to play well with StatET (a colleague reported not being able to use some SVN plugin for Eclipse with StatET installed).  From eclipse, install new software, using this as the repository:

 http://subclipse.tigris.org/update_1.6.x.

Go to Preferences, Team, SVN, and change the SVN interface to 'SVNKit (Pure Java) ... ' as there appear to be problems with the default JavaHL (JNI) interface.

Your institution should have instructions to set up subversion, for example Stanford's SVN setup instructions are here.  You can also set up subversion locally if you want a nice version-control tool.

If your project is open source, Google code hosts subversion repositories for free (up to 4 gigs), and as with most things Google, it is easy to set up and use.  You can use Google code's svn with Eclipse/Subclipse.  Using Google code's svn means that your project is publically available (in read-only form) via svn.  However, by hiding the "source" tab in your project's web view, you can remove references to the svn's url from your google code website. 

To connect to an existing repository, go to File, New, other, and expand the SVN options folder, then select 'Checkout Projects from SVN.'  Select 'Create a new respository' (this creates a new respository setting locally, not on the server).  Then enter the svn url.  If you connect via ssh, which is often required for institutions, a typical svn url is formatted as follows:

svn+ssh://username@domain.name.edu/folders/svn/project

There are three main svn commands you'll need: synchronize, update, commit.  Generally, you'll want to sychronize your project before starting work on it to make sure you have the most recent version, and after working on it to make sure that your changes are saved to the server and available to your collaborators.  Right click on a project, select Team then Synchronize with Repository. 

Eclipse%20SVN%20Synchronize.png

Sychronize will open a new view in Eclipse that will show the files that are different  on your local svn versus the server.  You can examine the specific differences between documents by right-clicking on any document that has been changed and selecting 'Open in Compare Editor'.  If you are ok with the changes, right click on the project and select Update.

Or, if you prefer, you can simply update your local version with the version on the server without looking at the changes.  To do so, simply right click on the project and select Team, Update to HEAD. 

When you are finished working with a project, go back to the team menu (right click on the project, select Team).  You can either Synchronize with Repository again, or you can just select commit. 

BibTeX for bibliography management

BibTeX is LaTeX's bibliography manager.  The idea is that you provide something like:

According to \citet{messing2009Bias}, the McCain campaign used increasingly ``photoshopped'' images of Obama in their ads as election day approached. 

And your pdf output looks like:

"According to Messing (2009), the McCain campaign used increasingly "photoshopped" images of Obama in their ads as election day approached."

And this reference is automatically inserted into your bibliography (in the correct order and format):

Messing, S., Plaut, E., & Jabon, M. (2009). Bias in the Flesh: Attack Ads in the 2008 Presidential Campaign. In Proceedings of the 2009 American Political Science Association Annual Meeting.

Your references are all stored in a database and are programmatically retrieved behind the scenes by LaTeX.  To set BibTeX up with LaTeX if you use APA format, first download this .bst file:

apaish.bst

Also, you'll want to add the following lines to your LaTeX preamble, which tell latex to utilize the natbib and booktabs packages:

\usepackage{natbib} 
\usepackage{booktabs}

You'll also want to add the following lines where you want your references to be located, e.g., just before \document{end}:

\bibliographystyle{/Users/user01/documents/workspace/apaish}
\bibliography{/Users/user01/documents/workspace/OpenResponse/MyLibrary} 

Replace the paths to these files with the appropriate path on your machine.  Note that you do not need to give BibTeX extentions to specifiy apaish.sty or MyLibrary.bib.

Zotero + BibTeX to take the pain out of your lit review

Zotero is Firefox plug-in that makes building a bibliography database much less painful than it used to be.  If you are browsing and see an article you want to cite in a journal website, Google Scholar, Amazon.com, etc., you simply hit a button in your brower's address bar and Zotero imports the citation into your citation database.  Here's what it looks like:

Zotero.png

Note that you'll want to check citations each time you import them to make sure they have all the relevant entries you need.  I recommend establishing a folder for each project you work on to keep things straight. 

You can also share bibliographies with your collaborators, by making a group library. For example, the Stanford Comm Department has a group library so that we can collaboratively build on each other's citation databases. 

Though Zotero has a plug in for MS-Word and Open Office, LaTeX documents look much nicer and are less fragile than MS-Word and Ooo documents (I've had MS-Word drop all of my citations for reasons that I cannot figure, and then had to manually re-enter them via the graphical user interface). 

One really nice thing about Zotero is that you can configure it so that if you drag and drop a bibliography entry into your LaTeX document, Zotero will automatically generate the citation for you.  Go to Zotero Preferences.  Open the Advanced pane.  Click on "Show Data Directory."  This will take you to a "zotero" folder.  The "zotero" folder will contain a "translators" folder.  You should be in this directory

~/.mozilla/XXXXXXX/zotero/translators/

where XXXXXXXX is some random string. 

Download this BibtexCiteKeyOnly.js file and save it to that directory. (This is a modified version of a bit of code posted to an Ubuntu Forum by 'MartinSzyska'). 

While you're at it, download this BibTeX.js file into the same directory, which will make sure that your in-document LaTeX references (e.g., \citep{messing2009Bias} ) match your bibliography key entries when you export the full bibliography.  [UPDATED FOR ZOTERO 2.11].

[ Or if you'd rather modify the javascript yourself or prefer a different reference id naming scheme, you can open "BibTeX.js" in a text editor like Notepad++ or Xcode. The line to change is:

      var citeKeyFormat = "%a_%t_%y";

For example, I changed it to

      var citeKeyFormat = "%a%y%t";

where %a is first author, %t is first word from title, %y is the year. ]

Then restart Zotero. 

After you restart Zotero, set "BibTex CiteKey-Only Exporter" as the Default Output Format in the Export preferences pane.  Now you can select a reference from Zotero and drag it off the screen into a waiting text editor (e.g., Eclipse).  Alternatively you can use Cmd+Shift+C to copy the \citep{key} to your clipboard. 

When you are ready to export your entire bibliography, right-click or control-click on the folder and select "Export Selection." I recommend downloading this .bib database file to the working directory for your Sweave/LaTeX project.