CS107 assign2: mywhich

assign2: mywhich

Your final task is to use your scan_token and get_env_value functions to implement the mywhich.c program, which is a simplified version of the Unix which command. It takes the names of executables (e.g. make, cat, emacs, etc.) and prints out their filesystem locations. Read the man page for the Unix version (man which) if you'd like, though note that your mywhich program will differ a bit from the full which behavior. Try out the provided sample solution, e.g. ./samples/mywhich_soln ls or ./samples/mywhich_soln make. For each command name, it prints the full path to the first matching executable it finds or nothing if no matching executable was found. The matched executables are listed one per line in the order that the command names were specified on the command-line. In this example, two of them were found, but no executable named submit was found in any directory in the user's PATH and thus nothing was printed for it.

myth$ ./samples/mywhich_soln emacs submit cp
/usr/bin/emacs
/usr/bin/cp

If no command line arguments are specified, mywhich prints out the directories in the search path, one per line.

This search is intimately related to how commands are executed by the shell. When you run a command such as ls or emacs, the shell searches for an executable program that matches that command name and then runs that program.

Where does it search for executables? You might imagine that it looks for an executable file named emacs in every directory on the entire filesystem, but such an exhaustive search would be both inefficient and insecure. Instead, it looks just directly inside those directories that have been explicitly listed in the user's PATH environment variable. The value for PATH is a sequence of directories separated by colons such as PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/games. When looking for a command, which considers the directories in the order they are listed in PATH and stops at the first one that contains a matching executable. In other words, here mywhich would first see if /usr/local/bin/emacs exists. If it does, mywhich prints it out and stops. If it doesn't, it would then check for /usr/bin/emacs. Then /bin/emacs. And so on. There is a library function called access that will come in handy (more on this later), which can tell you whether a given executable path is valid. Note that this process isn't doing an exhaustive search of all files directly or indirectly contained in a path like /usr/local/bin. It's just seeing if the specified executable name, e.g. emacs, exists directly inside the specified path location, e.g. /usr/local/bin/emacs.

PATH is set by default in your environment to include directories such as /usr/local/bin/ and /usr/bin/ which house the executable files for the standard Unix commands. (The name bin is a nod to the fact that executable files are encoded in binary). For ease of testing, mywhich also supports using the environment variable MYPATH, if it is specified, so that you can customize the path contents without changing your PATH environment variable (which may break other shell functionality). You can specify it with the env command, like this:

myth$ env MYPATH=/tmp:tools ./mywhich submit
tools/submit

There are several core string tasks in this program:

Getting the value of the MYPATH or PATH environment variable - the starter code uses your get_env_value function to do this
Tokenizing the specified path to get each individual location you need to search - you will use your scan_token function to do this
creating the full path that you wish to check - e.g. taking an individual location like /usr/local/bin and an executable name like emacs and constructing a path with the concatenation of the individual location, a forward slash, and the executable name: /usr/local/bin/emacs. Then you can pass that path as a parameter to the access function to check if it's valid.

Starter Code

mywhich.c is given to you with an incomplete main function that handles the case when mywhich is invoked with no arguments. You should first read and understand this code and then work out how to change/extend it to suit your needs. We don't require extensive commenting of your implementation unless there's something particularly clever or dense about how something is written.

Note that you can (and are encouraged to!) change code in mywhich as you wish, to decompose it, etc. Your goal should be to have your main function act as a concise summary of your overall program.

Some concepts to think about when looking over the code:

When applied to an array, the sizeof operator conveniently returns the actual size of the array. However, as soon as that array is passed as a parameter (it becomes a pointer to the first element) or as soon as we create a pointer to any of its elements, sizeof of that pointer will return 8 bytes instead of the array size because a pointer is 8 bytes. Additionally, note that the array size is not necessarily the same as the string length if it is a string.
If the user's environment does not contain a value for MYPATH, what does mywhich use instead?
How does a client properly use scan_token? (see sample uses in both tokenize.c and mywhich.c)
Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

The code we provide has been stripped of its comments and it's your job to provide the missing documentation.

Implementation

The program should be implemented as follows:

If there are no command line arguments, the program prints the directories in the search path, one directory per line. This is already implemented for you in the starter code.
If there are command line arguments, the program searches for a matching executable for each argument in the order they were specified, and prints the full path to the first matching executable it finds or nothing if no match was found. To do this, for each argument:
- Take the specified path (the value for MYPATH, if it exists, or for PATH otherwise) and tokenize it using your scan_token function and a buffer of size PATH_MAX. PATH_MAX is the system's limit on the maximum length of a full path (including the null terminator). for each token (which is a single directory path):
  - use that large buffer to construct the full path: e.g. if the token is /usr/local/bin and the executable name is emacs, you want to construct the path /usr/local/bin/emacs. You may assume the constructed path will fit in the PATH_MAX-sized buffer.
  - use the access function to check if that executable path is valid.
    - If it is, print out that path and move on to processing the next command line argument.
    - If it's not, try searching again with the next token

Note that you should not store all the path tokens in an array while tokenizing - you should perform the searches as you tokenize. For this reason, note that if there are multiple command line arguments, you will repeat the tokenization of the search path for each argument, and that's fine. You may assume that the user's MYPATH / PATH variables are always well-formed sequences of one or more paths separated by colons.

Here's more information about the access function: access is built-in function that is a part of the POSIX standards, which establish a set of C functions for interacting with the operating system. Whereas the standard C library functions provide only simple file reading/writing, the POSIX functions add more comprehensive services, including access to filesystem metadata (e.g. modification time, who can access files), directory contents, and filesystem operations that are necessary for implementing Unix commands like ls and mkdir, which are themselves just executable programs. The function access has the following signature:

int access(const char *pathname, int mode);

It takes in a path, pathname, and permissions, mode, and returns whether or not you have those permissions for the file at that path. To use access to check if an executable path is valid, we will be asking access to check whether we can read and execute the file at that executable path. If we can, it means an executable exists at that location. Otherwise, we assume none exists there.

Therefore, when you call access, the first parameter should be the constructed executable path you wish to check (e.g. /usr/local/bin/emacs), and the second parameter should be a bitmask that is a combination of the bitwise constants R_OK and X_OK (a value with the bits in both of these constants on). In this way, we specify that we want access to check if we have "read" and "execute" permissions for that file.

Be sure to carefully read the man page so you know how to properly interpret the return value from a call to access!

Testing

For this part, you should add at least 3 additional tests of your own in the custom_tests file that show thoughtful effort to develop comprehensive test coverage.

If you want to use the default PATH, you can look in locations specified in PATH and pick executables to test with. For instance, you can ls -l /usr/bin to see a list of things in /usr/bin and pick some to test with.

But you can also test using env and MYPATH to easily specify custom search paths; if you specify locations like tools or . in MYPATH, you can then refer to test files in those locations (like submit, which is in tools, or your own mywhich executable file, or your source mywhich.c file). Sanity check tests I and J do something similar to this - Test I is searching in tools and /bin, and it should find date in /bin and submit in tools (since this is the tools folder in your own assignment project folder, which contains an executable called submit). Check out test J as well for another example.

You can use these ideas in your own tests if you'd like, e.g. to refer to files in your assignment folder that are/aren't executable, etc.