In many fields -- Investments among them -- time-series data are of considerable importance. Databases containing time series of returns, values, exchange rates and the like are available from a number of sources. Each such source is likely to have its own database structure and access routines. This makes it difficult to write MATLAB programs that can easily utilize data from one or, a fortiori two or more such databases.
One solution to this problem is to adopt a standard database structure (such as Paradox, Access, or a SQL system) and translate each database from its native format to this common format each time a new version arrives from the source. Programs must, of course, be written or obtained to move data from source formats to the common format. And functions must be written to move data from the common format to MATLAB. While such an approach has a number of advantages, it requires considerable effort to maintain a structure of this type and may prove to be more trouble than it is worth. Moreover, it is complex and platform-dependent and far more than is needed for simple databases, especially those of one's own creation.
Another issue concerns updating. Whenever possible, it is desirable that programs work with data from the original files provided by the relevant sources. For example, one may need monthly values and returns for an Investor's holdings of a particular mutual fund. These are dependent, in turn, on (1) the number of shares held by the Investor each month and cash flows into or out of the account and (2) the net asset value per share for the fund at the end of each month. It is best that these two sets of information be combined to compute returns and values whenever the latter are wanted. Otherwise a separate routine must be run each time one or the other of the two databases is changed and a file created with the resulting values. This takes more time, since every combination needs to be run whenever there is an updated set of data. It also takes more space, since intermediate computations must be stored. Finally, and worst of all, it increases the probability that information will be used that is not completely up to date or, worse yet, in error.
Here we describe a system that can cope reasonably well with all these problems. Among the goals in its design were simplicity and ease of understanding. Hence we begin by describing a very simple version of the system, then proceed to more complex implementations. For ease of exposition we will call it the Time-series Information System, or TSI. This is a slightly pompous title for a relatively simple set of procedures, but the acronym makes it possible to lower the word count of this document.
Consider an ASCII file with the following lines:
Micropal XYZ Growth Fund USD 199101 1.23 199102 2.34 199103 -3.45 etc..
The first line gives the source, the second line the name, the third line the currency and the remaining lines the time series information as a set of two numbers: the date (in the form yyyymm) is followed by the value for the month.
For the present discussion, assume that the file is in directory C:\DBASES\Stox and that its name is XYZGRTH..TSI.
To read file XYZGRTH.TSI efficiently one could use the following MATLAB function:
function [src,nm,curr,dates,rets] = f_stox(sid); % make file name fname = [ 'c:\dbases\stox\' sid '.TSI']; fid = fopen(fname,'r'); % get header information src = fgetl(fid); nm = fgetl(fid); curr = fgetl(fid); % get data v = fscanf(fid,'%g',[2 inf]); dates = v(1,:)'; rets = v(2,:)'; % close file fclose(fid);
In this case the file name indicates the series identifier (sid). Thus the series id for this fund in file XYZGRTH.TSI is XYZGRTH.
To obtain the data for this fund one can give the MATLAB expression:
[src,nm,curr,dates,rets] = f_stox('XYZGRTH');
This will return the source (as a character string), the name (as a character string), the currency (as a character string), the dates (as a column vector) and the returns (as a column vector).
The first section of function f_stox sets up the file name and opens it with the file identifier fid. Note that the function "knows" that the series is in a file in directory C:\DBASES\XYZ and that its file name is the series id followed by .TSI.
The second section of the function retrieves the source, name and currency, using the standard MATLAB function fgetl, which gets the next line from the input file.
The next section reads the remaining information into a matrix with two columns using MATLAB's vectorized input capability. This is extremely fast and makes ASCII files competitive with binary files for speed as well as superior for readability and editability.
The penultimate section places the two columns of the input into two separate vectors representing the dates and the corresponding returns. The final statement closes the input file.
Note that no error checks are included, on the assumption that the files will have been checked (either by humans or some other program) before being entered into the system.
For convenience, function f_stox is included in the mia library.
Function f_stox can be considered a fetch routine. The MATLAB statement:
[src,nm,curr,dates,rets] = f_stox('XYZGRTH');
says, in effect:
Please fetch (f_) me series XYZGRTH in database stox. Send back the source, name, currency, dates and returns ([src,nm,curr,dates,rets]);
The format is thus:
[src,nm,curr,dates,rets] = f_database(series_id)
For each database, there will be a separate fetch routine. Thus if there is another database called Bonds with files in directory C:\DBASES\BONDS and ABCGOVT is a series in this database, it would be fetched with the command:
[src,nm,curr,dates,rets] = f_bonds('ABCGOVT');
This can be a very powerful approach. For example, the bonds database could be very different from the stox database. For example:
Micropal ABC Government Bond Fund USD 198901 199512 -1.23 4.23 -6.79 etc..
The fetch routine f_bonds must be different from f_stox in several respects. First, it must know that the bonds files are stored in directory C:\DBASES\BONDS. Next, it must know that the first and last date are given in lines four and five and that the remaining lines contain only returns. Finally, it must construct the complete dates vector so that it can be returned in the standard format.
While this leads to some work for the person who must write a fetch routine, it completely hides the differences in file formats between, say, the f_stox files and the f_bonds files from the user of the system. Thus the statements
[src,nm,curr,dates,rets] = f_stox('XYZGRTH'); [src,nm,curr,dates,rets] = f_bonds('ABCGOVT');
can be used with alacrity. Once the fetch routines have been written, subsequent programs can be oblivious to such nagging issues as file formats. Notice also that files can be left in their "native" format as long as one can write a fetch routine that will return the standard information.
One can go well beyond the previous examples. Files in a database might be in an idiosyncratic binary format. As long as the format is known (or can be discerned by careful examination) it should be possible to write an efficient MATLAB program that will read the file directly and return the standard information. The .TSD files in the Ibbotson Associates' databases fall in this category.
In some cases there is a single file (or possibly two) that contain(s) information on many different time series in a database. Here a somewhat more complex procedure is required. When the file(s) are received, a program (written, for example, in MATLAB) is run that creates a lookup table. The fetch routine then looks up the series name in this table to find the locations (offsets) in the data table(s) at which the information about the series in question begins. It then goes to the data table(s) and retrieves the information. To do this efficiently it is helpful if the series names, while strings, contain only numeric characters so that the lookup can be performed with a simple find command. This strategy works well with some of the databases provided by Micropal.
The earlier example in which information must be computed from two or more files can also be accommodated. For example, consider a fetch routine named f_invxyzr. If called with the statement:
[src,nm,curr,dates,rets] = f_invxyzr('XYZGRTH');
It could go to file XYZGRTH.HLD in directory C:\DBASES\INV to find the Investor's holdings and cash flows for fund XYZGRTH and to file XYZGRTH.PRC in directory C:\DBASES\PRC to obtain the month-end prices of the fund. The information would be used to compute the desired returns which would then be returned in the normal manner. The user could, if desired, remain totally ignorant of the operations being performed by f_invxyzr.
Another interesting case arises if a single file has both beginning of month values and monthly returns. For example:
Micropal XYZ Growth Fund USD 199101 1500 1.23 199102 2000 2.34 199103 3000 -3.45 etc..
In this case one could write two fetch routines. For example, f_xyzv would return the values, and f_xyzr the returns. Thus one could use the statements:
[src,nm,curr,dates,vals] = f_xyzv('XYZGRTH'); [src,nm,curr,dates,rets] = f_xyzr('XYZGRTH');
to obtain all the information or just one, if only one aspect were desired.
Note that in every case, the dirty work is done when the fetch routine for a database is written. The user need not be concerned with what is "under the hood". Even if the same person is the user and author of the fetch routines, he or she can forget the contents of the latter once they have been debugged.
The more time series one has, the more important it is to organize them in a useful manner. The simplest way to do this involves the creation of a database directory for each database.
Assume that directory C:\DBASES contains an ASCII file with the following lines:
stox STOCK MUTUAL FUNDS stox.ABCVAL ABC Value Fund stox.SPIND S&P Index Fund stox.XYZGRTH XYZ Growth Fund
The format is very simple:
row 1: columns 1-20: database id columns 21- : database name remaining rows: columns 1-20: database.series id columns 21- : series name
Assume that the above file is stored in C:\DBASES as stox.dbd and that for efficiency, every line in the file is of the same length (e.g. 80 characters).
In most cases it will be desirable to have the entries in a database directory sorted so that the series names are in alphabetic order.
Whenever one is interested in a series in the stox database, the database directory can be retrieved with the MATLAB function getsmat.m, as in:
dbdir = getsmat('c:\dbases\stox.dbd');
This function is part of the mia library. It is designed to operate rapidly, and hence requires that the ascii file referenced does indeed have lines of the same length. It returns a string matrix in which each row corresponds to a line in the ascii file.
One can look (directly or through a program) at rows 2:.. of dbdir to find the name of a desired series in this matrix. Say that it is XYZ Growth Fund. The full identifier is stox.XYZGRTH. To get its information we need to execute the statement:
[src,nm,curr,dates,rets] = f_stox('XYZGRTH');
This statement can be constructed, then evaluated automatically. The simplest way is to give the statement:
[src,nm,curr,dates,rets] = tsi('stox.XYZGRTH');
This calls the mia function tsi.m, which can be as simple as:
function [src,nm,curr,dates,rets] = tsi(id); % function [src,nm,curr,dates,rets] = tsi(id); % returns information from tsi id ('database.series') eval(['[src,nm,curr,dates,rets] = ... f_' strrep(id,'.','(''') ''');']);
If desired, the statement itself can be placed directly in a program.
A more complicated version of tsi.m could handle more information.
To make a tsi formatted file easy to import into spreadsheet programs it is often useful to begin and end the source and name lines with standard quotation marks. To accommodate this alternative, the tsi function can be expanded to simply remove any quotation marks found in those lines. The revised version would be:
function [src,nm,curr,dates,rets] = tsi(id); % function [src,nm,curr,dates,rets] = tsi(id); % returns information from tsi id ('database.series') eval(['[src,nm,curr,dates,rets] = ... f_' strrep(id,'.','(''') ''');']); src = strrep(src,'"',''); nm = strrep(nm,'"','');
Note that the details of database formats and underlying computations are now even more hidden from view. A series is now identified by its database name and series name. Whenever its information is required, it is simply requested using the tsi function. This is not really object-oriented programming, but it offers of the same advantages.
One great advantage of this format is the ability to view, edit and rearrange a database directory using any standard word processor. While there is a slight cost in processing time vis-a-vis that associated with MATLAB .mat files, it seems well worth bearing for the far greater flexibility associated with a standard format.
Often one has a directory with a large number of files, each containing a separate time series. To minimize effort and avoid errors of either commission or omission it is desirable to make the associated database directory automatically rather than by hand. This is usually easily done.
Assume that directory C:\DBASES\XYZ has a series of files, all of which have a name of the form xxx.TSI, where xxx is the series identifier. To create a database directory in c:\dbases with a file name of xyz.dbd, call the mia function mkdbd, as in:
mkdbd('xyz','xyz mutual funds','c:\dbases\xyz\','tsi');
This will find all the files with an extension of tsi in directory c:\dbases\xyz\, extract the name field from each one, then build a list, sorted alphabetically by name and place it in file c:\dbases\xyz.dbd. The first line will give the id ('xyz') and name ('xyz mutual funds') of the database.
Some of these routines call others in the mia library (in particular, putl.m and strsort.m).
While some of the operations in these functions are complicated, using them is not. Whenever a database directory is needed for a group of series contained in a specific directory, run mkdbd. Unix users might wish to let the Unix sort command handle the job of sorting the entries in alphabetic order (which strsort does very slowly).
In the TSI system each database has its own directory, which lists all the series "in it". The quotation marks are used here since a database may be more virtual than real, as a database identifier is no more than the identifier for the fetch program to be used to retrieve information.
The remaining ingredient needed to tie everything together is a list of all the databases. For example, assume that in directory C:\DBASES there is a file named tsi.dbl with the following string matrix:
dbl LIST OF DATABASES bonds Fixed Income Mutual Funds stox Stock Mutual Funds
The format is also very simple:
row 1: columns 1-20: database list id columns 21- : database list name remaining rows: columns 1-20: database id columns 21- : database name
Here too, the use of an ASCII format provides flexibility and ease of use in almost any context.
To find a series, one can start with the dbl list to find a likely database, load the associated database directory, then search through the latter to find the desired series. At this point the tsi routine can be used to obtain the associated information.
A database list can be created automatically using the mia function mkdbl.m, as in:
mkdbl('tsi','Time series Databases','c:\dbases\','dbd');
This will search for all the files in c:\dbases\ with an extension of dbd. A list will be created using the first line in each such file, with the name fields (column 20-...) sorted in alphabetic order. A first line will be added giving the list identifier ('tsi') and the list name ('Time series Databases'). The file will then be stored in C:\dbases as tsi.dbl.
The author has used this system with considerable success with databases from several sources comprising several thousand time series. In some cases it was desirable to break one large database into several pieces. Thus, while the Micropal U.S. mutual fund database includes over 5,000 funds, it was organized as a series of separate databases, for example: Equity Growth Funds, Government Bond Funds, etc..
Experience suggests that the approach offers more than adequate speed, very low probability of error, and a minimum of required recollection of file and computational details.