CS346 - Spring 2014
Database System Implementation
RedBase Part 1: The Record Management Component
Due Sunday April 13
|
Project parts are due at 11:59 PM on the due date. For project
parts submitted after 11:59 PM on the due date, there is a 1% penalty
applied to that project part score for each hour late. Each student
is allocated 2 free days and 12 late hours for the entire course with no
penalty. Once the free allowance has been used, penalties begin to be
applied. There will be absolutely no exceptions to this late policy.
It's crucial that students stay on schedule in this course -- RedBase
is a very big project.
The first part of the RedBase system you will implement is the
Record Management (RM) component. The RM component
provides classes and methods for managing files of unordered records.
All class names, return codes, constants, etc. in this component
should begin with the prefix RM. The RM component is a client to the
PF component: your RM methods will make calls to the PF methods we
have provided. The PF interface is described in a separate
document.
Your RM component will store records in paged files provided by the
PF component. To manage file contents conveniently, you will probably
want to use the first page of each file as a special header page.
This page should contain free space information, as well as whatever
other information (related to the file as a whole) you find useful for
your implementation. You also must decide exactly how records will be
laid out on pages of PF files. Your design task is simplified by the
fact that each file will contain a set of records that are all the
same size (although record sizes may differ across files). Fixed size
records make it easier to manage the records and free space on each
page, and fixed size records permit record identifiers within a given
file to be a simple combination of page number and record position.
More detailed implementation suggestions are given
below.
- Your RM component interface should be specified in an "external"
header file, rm.h, which you will later "#include"
in RM component clients. To save you some typing, we have created an
initial header file that declares the classes and public methods of
the RM interface described below. The file is called rm.h,
and it is transfered automatically to your code directory when you run
the setup script for the first part of the project (see the
RedBase
Logistics document for details). You must implement all of the
methods specified in this interface. If you would like to implement
additional functionality in the RM component (either now or later),
you are welcome to extend the interface as you please. However,
please do not add any new #include statements to
rm.h.
- The record identifier (RID) class is declared in a separate
header file rm_rid.h. This declaration is not included in
rm.h since there are some components of RedBase that need the
RID definition but no other RM classes. You must implement all of the
methods specified in this interface.
- For coding convenience and modularity, you may want to create
a separate "internal" header file (rm_internal.h, say)
containing declarations used only by the RM component. For example,
you might choose to put file and page header declarations in this
file, since they will not be used by other components.
- The fourth relevant header file for this component is a
"global" header file, redbase.h. Like rm.h, this
file is transfered automatically to your code directory when you run
the setup script for the first part of the project. You may
want to add declarations to this file as your coding progresses, but
please do not add any new #include statements to
redbase.h. This header file should contain those
declarations that are needed throughout the RedBase system.
The RM interface you will implement consists of five classes: the
RM_Manager class, the RM_FileHandle class, the
RM_FileScan class, the RM_Record class, and the
RID class. In addition, there is an RM_PrintError
routine for printing messages associated with nonzero RM return codes.
Note that we have not specified copy constructors or overloaded
= operators in any of the RM classes (see the
PF_Manager class in the PF Component
document for some discussion of these methods). You may add these
methods if you wish, and we suggest that you use them in any classes
that allocate dynamic memory. Note also that we have specified only
the public methods of each class.
All RM component public methods except constructors and destructors
should return 0 if they complete normally and a nonzero return code
otherwise. Return codes and error handling are discussed in detail in
a later section of this document.
*** RM_Manager Class ***
The RM_Manager class handles the creation, deletion, opening,
and closing of files of records in the RM component. Your program
should create exactly one instance of this class, and all requests for
RM component file management should be directed to that instance.
Below, the public methods of the class declaration are shown first,
followed by descriptions of the methods. The first two methods in the
class declaration are the constructor and destructor methods for the
class. All necessary initialization of the RM component should take
place within the constructor for the RM_Manager class. Note
that this constructor takes as a parameter the instance of the
PF_Manager class, which you should already have created (see
the PF
Component document). Any necessary clean-up in the RM component
should take place within the destructor for the RM_Manager
class. The constructor and destructor methods are not described
further for this class.
class RM_Manager {
public:
RM_Manager (PF_Manager &pfm); // Constructor
~RM_Manager (); // Destructor
RC CreateFile (const char *fileName, int recordSize);
// Create a new file
RC DestroyFile (const char *fileName); // Destroy a file
RC OpenFile (const char *fileName, RM_FileHandle &fileHandle);
// Open a file
RC CloseFile (RM_FileHandle &fileHandle); // Close a file
};
RC CreateFile (const char *fileName, int recordSize)
This method should call PF_Manager::CreateFile to create a
paged file called fileName. The records in this file will
all have size recordSize. This method should initialize the
file by storing appropriate information in the header page. Although
recordSize will usually be much smaller than the size of a
page, you should compare recordSize with
PF_PAGE_SIZE and return a nonzero code if recordSize
is too large for your RM component to handle.
RC DestroyFile (const char *fileName)
This method should destroy the file whose name is fileName by
calling PF_Manager::DestroyFile.
RC OpenFile (const char *fileName, RM_FileHandle &fileHandle)
This method should open the file called fileName by calling
PF_Manager::OpenFile. If the method is successful, the
fileHandle object should become a "handle" for the open RM
component file. The file handle is used to manipulate the records in
the file (see the RM_FileHandle class description below). As
in the PF component, it should not be an error if a client opens the
same RM file more than once, using a different fileHandle
object each time. Each call to the OpenFile method should
create a new instance of the open file. You may assume if a file has
more than one opened instance then each instance of the open file may
be read but will not be modified. If a file is modified while opened
more than once, you need not guarantee the integrity of the file or
the RM component. You may also assume that DestroyFile will
never be called on an open file.
RC CloseFile (RM_FileHandle &fileHandle)
This method should close the open file instance referred to by
fileHandle by calling PF_Manager:: CloseFile.
*** RM_FileHandle Class ***
The RM_FileHandle class is used to manipulate the records in
an open RM component file. To manipulate the records in a file, a
client first creates an instance of this class and passes it to the
RM_Manager::OpenFile method described above. Descriptions of
the constructor and destructor methods are not included for this
class.
class RM_FileHandle {
public:
RM_FileHandle (); // Constructor
~RM_FileHandle (); // Destructor
RC GetRec (const RID &rid, RM_Record &rec) const;
// Get a record
RC InsertRec (const char *pData, RID &rid); // Insert a new record,
// return record id
RC DeleteRec (const RID &rid); // Delete a record
RC UpdateRec (const RM_Record &rec); // Update a record
RC ForcePages (PageNum pageNum = ALL_PAGES) const; // Write dirty page(s)
// to disk
};
RC GetRec (RID &rid, RM_Record &rec)
For this and the following methods, it should be a (positive) error if
the RM_FileHandle object for which the method is called does
not refer to an open file. This method should retrieve the record
with identifier rid from the file. It should be a (positive)
error if rid does not identify an existing record in the
file. If the method succeeds, rec should contain a copy of
the specified record along with its record identifier (see the
RM_Record class description below).
RC InsertRec (char *pData, RID &rid)
This method should insert the data pointed to by pData as a
new record in the file. If successful, the return parameter
&rid should point to the record identifier of the newly
inserted record.
RC DeleteRec (RID &rid)
This method should delete the record with identifier rid from
the file. If the page containing the record becomes empty after the
deletion, you can choose either to dispose of the page (by calling
PF_Manager::DisposePage) or keep the page in the file for use
in the future, whichever you feel will be more efficient and/or
convenient.
RC UpdateRec (RM_Record &rec)
This method should update the contents of the record in the file that
is associated with rec (see the RM_Record class
description below). This method should replace the existing contents
of the record in the file with the current contents of rec.
RC ForcePages (PageNum pageNum = ALL_PAGES) const
This method should call the corresponding method
PF_FileHandle::ForcePages in order to copy the contents of
one or all dirty pages of the file from the buffer pool to disk.
*** RM_FileScan Class ***
The RM_FileScan class provides clients the capability to
perform scans over the records of an RM component file, where a scan
may be based on a specified condition. As usual, the constructor and
destructor methods are not described.
class RM_FileScan {
public:
RM_FileScan (); // Constructor
~RM_FileScan (); // Destructor
RC OpenScan (const RM_FileHandle &fileHandle, // Initialize file scan
AttrType attrType,
int attrLength,
int attrOffset,
CompOp compOp,
void *value,
ClientHint pinHint = NO_HINT);
RC GetNextRec (RM_Record &rec); // Get next matching record
RC CloseScan (); // Terminate file scan
};
RC OpenScan (const RM_FileHandle &fileHandle,
AttrType attrType,
int attrLength,
int attrOffset,
CompOp compOp,
void *value,
ClientHint pinHint = NO_HINT)
This method should initialize a scan over the records in the open file
referred to by parameter fileHandle. During the scan, only
those records whose specified attribute satisfies the specified
condition (a comparison with a value) should be retrieved. If
value is a null pointer, then there is no condition and all
records are retrieved during the scan. If value is not a
null pointer, then value points to the value that attributes
are to be compared with.
Parameters attrType and attrLength indicate the
type and length of the attribute being compared: either a 4-byte
integer, a 4-byte floating point number, or a character string with a
length between 1 and MAXSTRINGLEN bytes.
(MAXSTRINGLEN = 255 is defined in redbase.h.) Type
AttrType is defined in redbase.h as follows:
INT for integer, FLOAT for floating point number,
and STRING for character string. You will need to cast the
value into the appropriate type for the attribute (or, in the
case of an integer or float, copy it into a separate variable to avoid
alignment problems). If a character string has length n, then
the attribute and the value will each be exactly n
bytes long. They will not be <= n bytes, i.e., no "padding"
is required, and they are not null-terminated. Parameter
attrOffset indicates where the attribute is found within the
contents of each record. Parameter compOp indicates the way
that the record's attribute value should be compared with the
value parameter. The different values for compOp
are defined in redbase.h as follows:
EQ_OP | equal (i.e., attribute = value)
|
LT_OP | less-than (i.e., attribute < value)
|
GT_OP | greater-than (i.e., attribute > value)
|
LE_OP | less-than-or-equal (i.e., attribute <= value)
|
GE_OP | greater-than-or-equal (i.e., attribute >= value)
|
NE_OP | not-equal (i.e., attribute <> value)
|
NO_OP | no comparison (when value is a null pointer)
|
Parameter pinHint is included so that higher-level RedBase
components using an RM component file scan can suggest a specific
page-pinning strategy for the RM component to use during the file
scan, to achieve maximum efficiency. (See "The Buffer Pool of Pages"
section of the PF Component document for a
discussion of page pinning.) Type ClientHint is defined in
redbase.h, and you will need to define constants in addition
to NO_HINT if you plan to use it. You are free to implement
only one page-pinning strategy and ignore the pinHint
parameter, or you may implement more than one strategy based on
pinHint values now, or you may implement one strategy now and
consider adding new strategies later when you implement clients of the
RM component. Please note that using pinHint is
optional, and only default value NO_HINT will be passed
to OpenScan in the TA's test suite.
RC GetNextRec (RM_Record &rec)
This method should retrieve a copy of the next record in the file scan
that satisfies the scan condition. If this method succeeds,
rec should contain a copy of the record along with its record
identifier. This method should return RM_EOF (which you
should define) if there are no records left satisfying the scan
condition. You may assume that RM component clients will not close
the corresponding open file instance while a scan is underway.
RC CloseScan ()
This method should terminate the file scan.
*** RM_Record Class ***
The RM_Record class defines record objects. To materialize a
record, a client creates an instance of this class and passes it to
one of the RM_FileHandle or RM_FileScan methods that
reads a record (described above). RM_Record objects should
contain copies of records from the buffer pool, not records in the
buffer pool itself.
class RM_Record {
public:
RM_Record (); // Constructor
~RM_Record (); // Destructor
RC GetData (char *&pData) const; // Set pData to point to
// the record's contents
RC GetRid (RID &rid) const; // Get the record id
};
RC GetData (char *&pData) const
For this and the following method, it should be a (positive) error if
a record has not already been read into the RM_Record object
for which the method is called. This method provides access to the
contents (data) of the record. If the method succeeds, pData
should point to the contents of the copy of the record created by
RM_FileHandle::GetRec() or
RM_FileScan::GetNextRec().
RC GetRid (RID &rid) const
If this method succeeds, rid should contain the identifier
for the record.
*** RID Class ***
The RID class defines record identifier objects. A record
identifier uniquely identifies a record within a given file, based on
the record's page number in the file and slot number within that page.
You may wonder why this class name does not begin with RM. The
RID class actually will be used by several components of
RedBase, but for convenience we have introduced it as part of the RM
component. (Also, it's used so frequently that it's nice to keep the
name short!)
class RID {
public:
RID (); // Default constructor
~RID (); // Destructor
RID (PageNum pageNum, SlotNum slotNum);
// Construct RID from page and
// slot number
RC GetPageNum (PageNum &pageNum) const; // Return page number
RC GetSlotNum (SlotNum &slotNum) const; // Return slot number
};
RID (PageNum pageNum, SlotNum slotNum)
This method should construct a new record identifier object from a
given page number and slot number.
RC GetPageNum (PageNum &pageNum)
For this and the following method, it should be a (positive) error if
the RID object for which the method is called is not a viable
record identifier. (For example, a RID object will not be a
viable record identifier if it was created using the default
constructor and has not been passed to the RM_Record::GetRid
method.) If this method succeeds, it should set pageNum to
the record identifier's page number.
RC GetSlotNum (SlotNum &slotNum)
If this method succeeds, it should set slotNum to the record
identifier's slot number.
*** RM_PrintError
void RM_PrintError (RC rc);
This routine should write a message associated with the nonzero RM
return code rc onto the Unix stderr output stream.
This routine has no return value.
Return Codes and Error Handling
|
In general you should model your return codes and error handling after
the PF component. Please reread the relevant information in the PF
Component document, and take a look at the files pf.h and
redbase.h.
Each component specifies its own sets of nonzero return codes -- a
set of positive codes and a set of negative codes. Recall that
positive return codes indicate non-error exception conditions, or
errors from which the system can recover or exit gracefully; negative
return codes indicate errors from which the system cannot recover, or
after which the database may be corrupted. You will see as the
project progresses that it is very important to separate these types
of return codes. Minimum and maximum values for the positive and
negative return codes for each component are defined in
redbase.h. Each component also provides a routine that takes
one of its return codes and prints an appropriate message. In
addition, you may choose to have a global PrintError routine
that takes any return code and calls the appropriate component's
PrintError routine (based on the code's value); this decision
depends on your style of error handling and error propagation.
There are two ways errors may occur during execution of an RM
component method: because an error was returned from a call to a PF
component method, or because of an error or exception condition within
the RM component. In either case, the RM method should return a
nonzero code to its caller. For handling unexpected return codes from
the PF component, we suggest that you simply pass the PF return code
along, in which case your program structure should be such that an
appropriate message is printed eventually by a global
PrintError routine as described above.
if (rc = pf_obj.Method(...)) {
// unexpected return code
return(rc); }
An alternative and more verbose method, which you might consider
enabling via a compile-time flag, would be to print the message
associated with the PF return code, then provide a different RM return
code to the caller:
if (rc = pf_obj.Method(...)) {
// unexpected return code
PF_PrintError(rc);
return(RM_RETURNCODE); }
In addition to checking that calls to the PF component succeed as
expected, it also is very important that you check the incoming
parameters to RM methods for validity before doing anything to any
file. In general, you should do your best to ensure that a client
cannot corrupt RM component files or crash the RM component by passing
incorrect arguments to an RM component method. (Hint: we may check
this in our grading test suite!)
Implementation Suggestions
|
Here are some suggestions for your implementation. You are entirely
free to use alternative design ideas if you believe they will improve
the structure or performance of your code. You must not alter the RM
methods specified above, since we plan to test your code using these
methods. (In addition, higher-level components are designed with
these methods in mind.) However, as mentioned earlier, you are free
to extend the interface or alter it later if you find it helpful.
File and Page Layout
You will probably want each file to have a file header page on
which you store information about the file as a whole, followed by a
set of data pages. Information that might be stored on the
header page includes the size of the records in the file, the number
of records that may be stored on each page, the current number of
pages in the file, the location of pages with free space, etc. Each
data page will contain some header information and some records.
File Header Management
When the RM_Manager::OpenFile method is called, it should
call PF_Manager::OpenFile to actually open the file. You
will probably then want to copy the file header information into a
private variable in the file handle that refers to the open file
instance. By copying this information, you will subsequently be able
to find out details such as the record size and the number of pages in
the file by looking in the file handle instead of reading the header
page again (or keeping the information on every page). Once you have
read the header information from the header page into the file handle,
you can unpin the header page since there is no need to waste
buffer space by keeping the header page in the buffer pool the entire
time the file is open. Note, however, that any changes made to the
header while the file is open (e.g., the number of pages in the file,
or the location of free space) must be written back when the file is
closed. You can do this by keeping a modified flag in the
file handle and writing the information back when the flag is set.
Record Identifiers
The RID class described above defines unique identifiers for
records within a given file. Record identifiers will serve as tuple
identifiers for higher-level RedBase components. Thus, the identifier
for a given record should be permanent: the components of a record
identifier should not change if the record is updated, or as the
result of an insertion or deletion of a different record.
Keeping Track of Free Space
When inserting records, you are strongly discouraged from performing a
linear search through pages in order to find a page with free space.
One solution is to effectively create a linked list of pages that have
empty slots in them. You can do this by placing appropriate pointers
in page headers, with a pointer to the first page in the list included
in the file header. When you need to insert a record, insert it into
an empty slot in the first page of the list. You will need to modify
the list as records are inserted and deleted.
There are a variety of ways to keep track of free record slots on a
given page. One efficient method is to use a bitmap: If each data
page can hold n records, then you can store an n-bit
bitmap in the page header indicating which slots currently contain
valid records and which slots are available. Note that the size of
the bitmap needed for each page is dependent on the number of records
that can be stored on that page, which in turn is dependent on the
record size for the file. Hence your bitmaps will be the same size
for each page of a given file, but bitmap sizes should be different
for different files. One way to implement variable-size bitmaps is as
arrays of char's or int's, where the individual bits
are manipulated using the operators &, |, and
^.
Note that under no circumstances should you place a limit on the
total number of records that can be stored in a file -- each file
should be able to grow arbitrarily large. (Admittedly there is an
indirect limit because PageNum is defined as an integer, but
it should be possible to change the type of PageNum -- say to
a long integer -- without changing your code.)
File Scans
You may find it helpful to keep in mind how file scans (as implemented
by class RM_FileScan) will be used by higher-level RedBase
components. A file scan will be used either to iteratively return all
tuples of a relation stored in an RM file, or to return all tuples in
the relation that satisfy a given condition. Scans will be used when
processing SQL-like Select, Delete, and
Update statements, so each record returned will either become
part of a query result, will be deleted, or will have a value changed.
The insert statement will add a single tuple at a time, so
records will not be inserted via a file scan.
As described in the RedBase
Logistics document, each student is expected to:
- Produce a 1-2 page description of your design and testing in a
plain text file called rm_DOC.
- Include comments in your code.
The first requirement enables the TA to understand how you have
designed your program, how you tested it, and any known bugs.
Remember, the clearer this description is, the faster the TA will be
able to evaluate your code, and the better mood he'll be in while he
does it! In the code itself you should include enough comments to
delineate and explain the functional units of the code, but you don't
need to go overboard.
Please do not duplicate documentation. Your rm_DOC file
should describe your overall design, novel features, key data
structures, etc., while comments in the code should serve to explain
the lower-level mechanical details. Also, as per the Honor Code
statement provided in the General
Information page, don't forget to cite in rm_DOC any
assistance you received (from the instructor, TA, other students in
the course, or anyone else) in program design or debugging.
As a reminder from the RedBase
Logistics document, the grade for your program will be based on
four aspects:
- Functionality - that the program implements the
specification and passes the TA's test suite
- Documentation - as described above
- Design Choices - based on an email message from the TA
requesting details on particular aspects of the implementation (see RedBase
Logistics)
- Modularity/Correctness - usually full credit if your code
is reasonably modular and doesn't have gaping inefficiencies
For your convenience, we have provided an initial set of tests for
your RM component. The tests are in rm_test.cc, retrieved
when you run the setup script for the first part of the
project. Although these tests exercise some aspects of the RM
component, they are not comprehensive. Thus, we strongly
encourage you to augment them with a number of additional tests of
your own in order to thoroughly exercise your code. During the
grading process, we will use a much more comprehensive test suite in
order to check your component for correctness.
Submission of the RM component -- along with all other project
parts -- will take place electronically. Instructions are in the RedBase Logistics document. Your
rm_DOC documentation file should be submitted with your
code. The TA will be linking test code to your submitted
librm.a
, which should be the result of running
"make
" from within your submission directory. Everything
must link correctly if the test program calls methods in
rm.h
. Also, please ensure that you have not
removed the -DPF_STATS flag from the Makefile prior to
compiling the code that you submit.
RedBase I/O Efficiency Contest
|
If you're angling to win the RedBase Efficiency Contest then it is
important to consider I/O efficiency in every component, starting with
this one. Remember that one I/O is counted each time a page is read
into or written out from the buffer pool. There are a number of
places where you can save one, two, or even more I/O's by careful
coding of RM component methods. Saving a few I/O's now will pay off
greatly as you code the higher-level RedBase components.