MrFiles

From VISTA LAB WIKI

Jump to: navigation, search

Contents

[edit] mrFiles: neuroimaging database tools

Adapted from a classic MD eastern shore story:

Story Translation

mrFiles

mrNot

OSAR

cmnrFaces

LIB

mrFiles

Them are files

Them are not

Oh yes they are!

See them interfaces?

Well I'll be

Them are files

Current repository is at http://orange.stanford.edu/mrFiles/?

[edit] Overview

There are two basic ways of interacting with the repository, either through-the-web (as in by clicking the link above), or using client software. Currently, the only client software is implemented as part of the VISTASOFT suite of tools.

The BIG IDEA is the notion of a permanent Uniform Resource Identifier (URI). A URI is intended to uniquely and permanently identify some resource or bundle of information. A URI for something is often the same as a URL, and in the case of mrFiles, this is true 100% of the time. So, if you don't see a distinction between URIs and URLs, don't worry about it! Once something is stored in a URI in this system, that's it. You can never use that space for anything else.

If you are really interested, the distinction is similar to the distinction between a word token and the semantics of that word (often denoted WORD vs. word). A URI is a logical entity much like a noun, a URL is a concise way of describing how to obtain the data referred to by a URI. It is usually convenient to use the same string as the URL and the URI, although there are some cases in which this is not so. A common example of this is that you may wish to denote uniquely some real-world entity, but not actually store any data on-line. In this case, you might use a URI, but have no URL, as there is no data anywhere to fetch.

A direct conflict arises here when a revision or otherwise new version of an ROI with the same label is created. The solution to this is that all datasets are stored inside of the label per se. Thus, if you have a URI ending in ../V1, you might have below it ../V1/0, ../V1/1, etc. Users are encouraged to distinguish between these numbered ROIs with the same label using "metadata", or notes about the data proper. Examples of metadata would be the user that created an ROI, the technique used to create it, or a freeform note describing how this ROI was derived from another (in which case, you could use the URI to refer to the previous ROI!). Other examples would include information about how the data should be presented, as in color, filled vs. outline, etc. The data proper would be the list of coordinates.

Note that currently, all metadata is stored permanently with a particular version of the data. If you want to change anything at all about an ROI, even something like color, you must create a new one. This may result in a great deal of redundancy. But fear not, storage is cheap! There is also the potential for adding compression at some point if deemed necessary.

[edit] Using Matlab

All Matlab functionality is avialable via the File menu in a mrVista / mrLoadRet session. ROI specific functionality (loading and saving) is under the ROI sub-menu. Additionally, there is an option in the main File menu to select your repository and specify a username and password. This should be your username and password on white.

[edit] Using a Web Browser

Simply point your browser to the repository (try starting with the link at the top of this page), a file in the repository or a "subdirectory" of a file, followed by a question mark. The question mark triggers "info" mode, and the page should be fairly self-explanatory.

[edit] For Developers

The system is designed around REST principles. This means:

  1. All interaction with the server:
    1. Is atomic - there are no two-step operations
    2. Uses the basic HTTP/1.1 protocol - a user can only GET, PUT, POST and DELETE to a URL
  2. Unlike XML-RPC, SOAP, or more traditional *DBC, there is no additional standard for specifying variables, functions or anything of the sort for interaction with the server
  3. The same code (mostly) can be used for interacting with a web-browser, a client, or even something like a WebDAV filesystem

A central design point is that HTTP operations can be at the level of a file on the repository, or a subtree. For example, the Matlab client keeps one subject's worth of data in a local file. The client might then PUT that to a "subdirectory" of a server-side file which contains all of the subjects for a study.

The system uses HDF5 (these files should end in .h5) for the storage and transmission of data. HDF5 is a compact way of storing data hierarchically, as in a filesystem. For version 1.0 of the system, the organizational hierarchy used throughout the VISTASOFT suite of tools is adopted for the storage of data in HDF5 files. So, if an ROI was once stored as <base-dir>/<subj-name>/Inplane/ROIs/LFG in the filesystem, then that's also where it will be in the HDF5 file. You can think of the JDF5 file as a kind of archive (like a tar file), where the base dir is simply the "top" or root of the archive.

HDF5 also allows arbitrary metadata. So, a given ROI may be stored as a matrix, but can have other descriptive information stored with it, such as color, the user that generated it, etc. The system makes no restrictions on metadata, but encourages use of metadata fields that have already been used elsewhere. So, if you've used "color" as a metadata field before, that will be available as an option to specify in new ROIs. But, if you can't find a suitable field, you can always create one. Both the server and the client software should accomodate arbitrary metadata.

The matlab client end basically consists of being able to read and write in this new format, and being able to transfer these files back and forth to a web-server. The client can only handle simple operations, and will simply open a web-browser for more complex ones.

An arbitrary number of HDF5 files can be in a given local directory, but there are two "special" files: local.h5 and repository.h5. These are discussed in more detail under the user instructions. Files on the server should represent studies or projects. The server can automatically copy subtrees together from individual subjects' hdf5 files into a larger study-sized hdf5 file.

[edit] Web server architecture

The server code currently consists of three python modules. Currently, there is

  1. A class in mrFiles_repository.py which manages the actual http connection and response
    • Uses mod_python to interact with an Apache server
    • Authentication is handled at the Apache level (currently using mod_auth_pam, but could also use any mod_auth_* module - including the one that comes with mod_python
  2. A class in mrFiles_file.py which handles manipulations inside the actual HDF5 files.
    • Uses the pytables library to handle HDF5.
  3. A module containing functions to display more detailed web pages in the case of a complex query
    • Uses the PSP module from mod_python to handle web-page templating in a very PHP-like way

One idea is that eventually, I'll include some AJAX code in the website to allow HTTP operations not supported by most browsers (e.g. PUT, DELETE, and more general POST operations).

[edit] NOTES

Useful documentation:

Things to look out for:

  1. If you include 'PythonEnablePdb On' AND run 'httpd -DONE_PROCESS', it will drop into a python debugger (pdb) prompt from the command line where you launched httpd. If you have Pdb enabled, and NOT httpd with the -DONE_PROCESS flag, you will get strange errors.
  2. Ensure that the apache user or group is able to read and write the repository directory, as well as a tmp directory beneath it, as well as all files there.
  3. Ensure that Apache is able to read the python code files.
  4. Standard setup config files are included in the server_code/apache_conf directory in the mrFiles CVS repository
  5. I am not entirely sure that the python.conf file could be processed before the auth_pam.conf file. Redhat FC4 stock apache processes files in the /etc/httpd/conf.d directory in alphabetical order. If you have strange problems with authentication, this may be a direction to check.

[edit] TODO

  1. Mechanism for download and save of current search result set
  2. Add search for ROI by coordinate or by metadata (UI is there)
  3. Test and implement compression filters (zlib + shuffle)
  4. Ability to move / rename repositories? This should perhaps leave a placeholder / redirect behind
  5. Provide for coordinate transforms, e.g. into taliarach
  6. Refine handling of external references (as in index.h5 file)
    • Note that the new HDF5 (1.8.0 - currently in alpha) includes efficient, native support for external refs

[edit] Matlab client architecture

The matlab client actually uses a hybrid of Java and Matlab in the same file for http transactions. This is an example of Bad Design. All matlab code (hopefully) adopts the convention of prefixing any "java" variables with j_. If you modify the code, please respect this convention. Eventually, I'd like to see java code mostly in separate files. The workhorse function is http_transport_file.m. It's pretty straightforward, and handles for now only PUT and GET commands.

The matlab hdf5 functions are currently used to load and store data. This is also perhaps a bad decision, as these functions are quite impoverished. A rewrite in java is probably a good move here also. You can find the java libs here. For now, the files roiLoadHdf5.m and roiSaveHdf5.m are called at the top level (e.g. by mrVista).

[edit] NOTES

  1. Matlab always reports full "path" names. Thus, there is a recurring paradigm of doing 'h5name(length(h5loc.Name)+2:end)' to strip off the leading path
  2. By default, the Matlab / Java http handler raises an exception if it doesn't get a 200 OK return code from the server. It would be nice if we could use return codes to communicate usefully.

[edit] TODO

  1. Aggregate and store available information from mrVista & server:
    1. mrSESSION information, including alignment matrices, maybe also DATATYPES
    2. experiement name fetched automatically from file, or otherwise choices from server
  2. Modify in-memory storage of ROIs to a cell-array, to allow for heterogeneous structures (i.e. arbitrary metadata) - Bob / Brian should do this
  3. Include hook to call browser for complex actions - include local path so files can be automatically saved there
  4. Check to see if we have a pre-loaded ROI with the same name already and deal with it nicely
Personal tools