CS107 assign2: scan

assign2: scan_token

Note: one important requirement for this part is, before you make any changes to the tokenize_util.c file, to write and submit a portion of your custom tests for this part. See the testing section below for important information about how to do this.

Overview

Your second task is to test, implement and write manual page documentation for a function scan_token in tokenize_util.c, with the following signature:

bool scan_token(const char **p_input, const char *delimiters,
                char buf[], size_t buflen);

scan_token is an improved version of strtok from lab2. Such a function to tokenize a string by delimiters is handy to have, but the standard strtok has design flaws that make it difficult to use. The intention of scan_token is to separate a string into tokens in the manner of strtok but with a significantly cleaner design.

scan_token takes in a pointer to a string and the delimiters to use to tokenize it, and puts one token, which is the next token in the string, into the specified buffer buf, and returns true or, if no more tokens are left, it returns false. If a token does not fit in buf, scan_token writes as much as fits in buf (buflen - 1 characters plus null terminator) - the remaining characters can be read via future call(s) to scan_token. Note that scan_token's first parameter is a double pointer, a pointer to a char *. This is necessary because scan_token needs to change the char * itself to advance past characters that it has previously scanned. The caller will thus have to call it several times to tokenize the entire string. Here is an example:

const char *input = "super-duper-awesome-magnificent";
char buf[10];
const char *remaining = input;

while (scan_token(&remaining, "-", buf, sizeof(buf))) {
    printf("Next token: %s\n", buf);
}
// once we get here, `remaining` is the empty string

Running the above code produces this output:

Next token: super
Next token: duper
Next token: awesome
Next token: magnifice
Next token: nt

You may assume the following about the parameters to scan_token, and do not need to check to ensure these are true:

buf is always a valid address to a region of memory that has space for buflen characters
buflen is always greater than 1
p_input is always a valid pointer to a pointer
*p_input is always a well-formed (e.g. null-terminated) C-string. (may be empty string)
delimiters is always a well-formed C-string containing one or more delimiter chars. (i.e. it will never be the empty string)

Note that even if you wish to add checking for some of these assumptions, e.g. determining whether p_input is valid, or that buf actually has buflen characters of space, it's tough to do. Determining whether a pointer is valid, for instance, is not solvable in general, and any measure to detect bad pointers will be half-hearted at best. As the implementer, at times you have little choice but to clearly document your assumptions and assume the client will adhere to them, and write your code accordingly.

Testing

Note: for full custom tests credit, you must submit some of your custom tests for scan_token before making ANY changes to tokenize_util.c. See below for details.

For this part, you should add at least 5 additional tests of your own in the custom_tests file that show thoughtful effort to develop comprehensive test coverage. Additionally, to encourage writing some tests before writing code and further tests after, at least 3 of these tests must be written and submitted before making any changes to the tokenize_util.c file, and at least 2 more tests must be added in your final assignment submission.

To submit your initial custom tests, run the submit tool like this:

tools/submit custom_tests

This submits all files and marks that submission as the one to review when grading to see your initial test cases. To receive credit, the tokenize_util.c file must not have been modified at any point leading up to this submission. During grading, we will use the scan_token custom tests included in this submission, confirming via the automatic backups made by the tools that there were no edits up to this point to the tokenize_util.c file, and also grade the work in your final submission. You may run tools/submit custom_tests multiple times if you'd like - we will use the latest custom tests submission to grade your initial custom tests. Then, to submit your final submission, run tools/submit.

This function can be tested in isolation with the provided tokenize.c program, which you do not need to modify or comment, but whose code you should read over to understand how it calls your scan_token function. You can also write sanitycheck custom tests with tokenize.

If you execute ./tokenize, it will use your scan_token function to calculate the number of syllables of various test words. You can also run it by specifying other text you would like to use to test, in this format:

./tokenize <DELIMITERS> <TEXT> <BUFSIZE (OPTIONAL)>

For example, if you would like to tokenize the text "hello I am a C-string" using the delimiters "-" and " ", you could run:

./tokenize " -" "hello I am a C-string"

The first string contains the characters to use as delimiters, and the second string is the text to tokenize. This command should output something like:

./tokenize " -" "hello I am a C-string"
Tokenized: { "hello" "I" "am" "a" "C" "string" }
remaining:

You may optionally specify a third argument which is the size of the buffer to pass when tokenizing. If you do not include this command line argument, the buffer is sized to always have enough space to store the entire token.

Implementation

The function should be implemented as follows, using appropriate string.h functions (see our standard library guide) - the first two steps borrow from how strtok is implemented:

scan the input string to find the first character not contained in delimiters. This marks the beginning of the next token.
scan from that point to find the first character contained in delimiters. This delimiter (or the end of the string if no delimiter was found) marks the end of the token.
write this token as a valid C string to buf, which has space for buflen characters. scan_token should not write past the end of buf.
- If a token does not fit in buf, the function should write buflen - 1 characters into buf and write a null terminator in the last slot.
update the pointer pointed to by p_input to point to the next character in the input that follows what was just scanned.
- If the scanned token consumed all of the remaining input, *p_input should point to the input's null terminator.
- If the scanned token was too big to fit entirely in buf, then *p_input should point to the character in the input immediately after the buflen - 1 characters that fit in buf. In other words, the next token scanned will start at the first character that would have overflowed buf.
return true if a token was written to buf, and false otherwise.

scan_token should not emulate the bad parts of strtok's design. Specifically, it should not use static or global variables and should not modify the input string's characters.

Before moving on: make sure you have thoroughly tested your scan_token function, making sure to cover various different cases of possible inputs, and that you have written your custom tests. You will use this function later in the assignment, so it's vital to ensure it's thoroughly tested before moving on!

Documenting `scan_token`

When functions have assumptions, limitations or flaws, it is vital that the documentation makes those clear. Otherwise, developers don’t have the information they need to make good decisions when writing their programs. For example, one of the design flaws of strtok is that it modifies the characters of its first argument. Luckily, this is documented in the BUGS section of the man page (though it should perhaps be emphasized more than just as a minor detail). If we were unaware of this flaw, we might assume the argument wasn't modified, breaking other parts of our program or even introducing potential vulnerabilities.

For this next part of the assignment, your task is to write a "manual page" for your scan_token function. Function documentation like this is different than comments in your actual program code. While header, inline and other comments should be brief and standalone, a manual page reference is more thorough and cohesive. In manual pages with multiple sections, text at the beginning of a section should explain some of the concepts, and should often make some general points that apply to several functions or variables. Additionally, manual page documentation should be written more formally than code comments. As the GNU standard explains, "the only good way to use [code comments] in writing a good manual is to use them as a source of information for writing good text."

In your readme.txt file, we have provided a template outline for your manual page. Fill in the remaining components to fully document your scan_token function. Here is the starter template, for reference:

scan_token DOCUMENTATION
INSTRUCTIONS: Fill in the sections marked with a TODO below.
Your documentation should be original (i.e., please do not copy and paste from the assignment spec).
NAME
    scan_token - # TODO write a one-sentence description of scan_token
    bool scan_token(const char **p_input, const char *delimiters,
                    char buf[], size_t buflen);
ARGUMENTS
    const char **p_input - #TODO: write one sentence explaining the p_input argument
    const char *delimiters - #TODO: write one sentence explaining the delimiters argument
    char buf[] - #TODO: write one sentence explaining the buf argument
    size_t buflen - #TODO: write one sentence explaining the buflen argument
RETURN VALUE
    #TODO: write a 1-3 sentence description of the possible return values of scan_token.
    Make sure to include a description of what will be stored in the buf argument upon return.
ASSUMPTIONS
    #TODO: write 2-5 sentences explaining the assumptions made by your scan_token function.

    Here is an example: The scan_token function assumes that the buf argument
    has space for buflen characters.
DESCRIPTION
    #TODO: write one paragraph explaining the implementation of your scan_token function.
    This section should include (high-level) implementation details. You can use your function-header
    comment as a starting point for this section.

Tip: when you need to use scan_token later on in your mywhich program, try referring to just the manual page you wrote here. If you find that you need more information in order to effectively use the function, consider adding what might be missing. Your goal for your manual page reference should be that a client can effectively use your function without seeing the code, just like how you can use string functions without seeing their implementations.

Overview

Testing

Implementation

Documenting scan_token

Documenting `scan_token`