Written by Julie Zelenski
You have read our advice on "Best Practices" for 107 assignments, haven't you?? :)
Our provided CVector is a full-featured, memory-managed, re-sizable array and CMap is a super-efficient hash table. Each is a clean package that abstracts away the low-level details, allowing you to put your energy toward solving more interesting problems. Data abstraction is a power tool, and one you have taken advantage of throughout the CS106 courses. What makes it a bit different in C is that these generic collections have to operate using a void* interface that requires the client stay on their toes at all times.
These are two short programs (our solution is only a total 3 pages of code) and the handy CVector and CMap do a lot of the heavy lifting. However, don't mistakenly assume this will be a breeze. The code can be dense and it's frighteningly easy to make mistakes when handling memory using void* pointers. Go slow, develop in careful steps, do lots of testing along the way, and keep gdb and Valgrind in the loop.
Start by studying the interfaces in cvector.h and cmap.h. Using them as a client requires a solid understanding of how void* is used and steadfast vigilance about details. We strongly recommend doing this fill-in-the-blank warmup exercise before you even think about coding.
char* pointer in more than one collection), but this necessitates that you coordinate the sharing to avoid surprises such as one collection freeing the strings out from under another. Alternatively, you can make distinct copies to avoid aliasing issues. Either is fine with us, but be sure to have a plan and document it.CVector and CMap implementations, so be sure to read the .h header files and leverage those features rather than rolling your own.void* retrieved from the collection back into its specific type pointer) but don't become enamored with sprinkling them about. When you use a typecast, you are taking full responsibility for type-correctness and suppressing any help you might have otherwise gotten from the compiler. Use typecasts exactly and only when you must.const. The const qualifier on the void* is an indication that the pointee contents is only to be read, not written. Never cast a function pointer to get around type mismatch; instead, change the function prototype to correctly match.void* , but you know they are really int* or struct binky * . You could add a typecast to the argument everywhere you use in the function body, but a better strategy declares a local variable of the correct pointer type, assigns it once from the parameter, and uses the strongly-typed local in the rest of the body.The data on student submissions below is from last quarter. The assignment has been re-designed this quarter, but retains similar elements. Your mileage may vary.
Time spent. Student hours spent (self-reported):
A symbolic link (such as the slink we provide in your repos) is a directory entry that is a synonym for another path. It's like the filesystem version of a pointer. You can create a symbolic link using ln -s srcpath dstname. The directory /afs/ir/class/cs107/samples/assign2/loopy contains a variety of nested and circular links that might be useful for testing.
Yes. Your output should always list the same number of files as the sample. However, it might refer to some files by a different full path than the sample when multiple paths lead to that file. The full path chosen is dependent on which path was returned earlier from readdir and readdir makes no guarantee about the order it returns the directory entries. It has stable behavior per-host, so your program and the solution running on the same host should produce the same paths. Running on different hosts could mix things up, but rest assured the permutation in output will not be an issue in grading. A change in order of traversal will never cause a difference in number of files found, so those such mismatches are a sign of programming error.
This little program does a surprisingly hefty amount of work, especially when searching a larger filesystem. The searchdir program makes calls to opendir call to traverse the directory entries. Behind the scenes, opendir makes a large allocation (~32K per call, freed by closedir) causing heap usage to balloon when traversing a deeply-recursive directory structure. A directory with 100 subdirectories is a 3MB hit to traverse. The directory /usr/include has over 1400 subdirectories so be prepared for chewing through ~45MB of memory! Don't be alarmed! As always, measure the sample using time and Valgrind to get a benchmark on what is reasonable. It would be unwise to attempt to search an enormous directory such as /afs/ir/class/ or /. We will not test on such cases and don't expect you to either :-)
C struct declarations are almost, but not exactly, the same as C++. In C, the following declares a new struct type:
struct coord {
int x, y;
}; // don't forget to end with semi-colon, without it you'll be mired in compiler errors
The name of the type is struct coord and you cannot drop the struct keyword; declaring a coord will not compile. You can wrap the struct definition in a typedef to set up a nickname:
typedef struct coord {
int x, y;
} Coord; // need to end with semi-colon here, too
This second version declares struct coord and gives it the nickname Coord. Either name (struct coord or Coord) can be used for the same type. The . (dot) operator is used to access the fields within a struct. If you have a pointer to a struct, your first attempt to access the fields is likely to run afoul of the fact that dot has higher precedence than * (dereference). You can add parentheses to force the desired precedence, or better, use the -> operator which combines . and * for this common need.
Coord origin; // allocates struct on stack
Coord *p = malloc(sizeof(Coord)); // allocates struct in heap (note: sizeof works for correctly for structs)
origin.x = 0; // access field from struct variable
// these next 3 lines access field via struct pointer
*p.x = 0; // this line doesn't compile because default precedence applies . first then *
(*p).x = 0; // this line works, use parens to override default precedence and apply * first, then .
p->x = 0; // this line shows the preferred way to access field via struct pointer
origin = *p; // btw, can copy a struct using ordinary assignment (copies all fields in one go)
*p = origin;
The raw declaration of function types can be a bit goopy so CVector and CMap use typedef nicknames. For example, the cleanup function for CVector takes one void* parameter and returns void. The typedef below establishes the nickname CleanupElemFn as synonymous to the raw declaration of a void (*)(void *) function type.
typedef void (*CleanupElemFn)(void *addr);
This typedef allows parameters and variables to be declared using the typename CleanupElemFn. To write a cleanup function as a client, you merely need to define a function with the correct prototype. The function's name is used to access its function pointer . A function pointer is compatible with any function type matching its prototype.
void byebye(void *ptr)
{
Binky *b = (Binky *)ptr;
// do cleanup of b here7f
}
int main()
{
CVector *all_my_binkys = cvec_create(sizeof(Binky), 10, byebye);
The standard library qsort, bsearch, and lfind/lsearch functions and CVector sort and search all use the same format for the comparison callback. It is a two-argument function that take pointers to elements (const void *) and returns an integer which indicates whether the first element is less than, equal, or greater than the second. Here is a sample callback for comparing double elements:
int cmp_dbl(const void *p1, const void *p2)
{
double d1 = *(double *)p1, d2 = *(double *)p2;
if (d1 < d2) return -1;
if (d1 > d2) return 1;
return 0;
}
initialization discards ‘const’ qualifier from pointer target type. How do I resolve these?The const qualifier on a declaration is an indication that type is only to be read, not written. A const char * and a char * are not quite the same type. You can read characters from either, but only the non-const version allows the characters to be changed. While it is perfectly fine to supply a non-const pointer where a const is expected (no cast required or recommended), the inverse will raise a warning from the compiler. Throwing down a typecast to const would quiet the compiler, but that's not the appropriate fix. If write-access is required, a read-only pointer is not going to work and a cast only serves to cover up the mistake. You should instead be fixing the logic to supply the correct pointer. If write-access is not needed, then the simple and correct choice is to add the const qualifier to the declaration.
If the target word is not in the corpus (for whatever reason), print an empty list of synonyms and move on. No need to report or handle as an error.
If you want to put "widgets" in your CVector/CMap, the preference is to directly store the actual widget itself. Storing pointers when the value itself could have instead been stored is adding an unnecessary level of indirection, resulting in extra overhead and added complexity. For what reasons might it be required/appropriate to instead store a pointer to widget stored elsewhere?
void*!You will always refer to the element by its address (a void*) when adding to or retrieving it from the container, but that does not imply that the element type is itself must be pointer. The CVector and CMap are capable of storing elements of any type: int, struct, fixed-size array, pointer, and so on. When you pass/retrieve the element, you refer to it at a level of indirection away from its base type, thus an int element will be retrieved as a int *, a struct * element as a struct **, and so on.
The internal structure of these types is opaque to the client (or at least will be until you complete assign3 :-), and not well-suited for you to manually dissect in the debugger. Instead, we recommend that you write your own debugging functions print_map and print_vec that iterate and print the elements. You can then call these functions from within the debugger as needed:
(gdb) p print_map(mymap)
Try to use scanf as fully as you can. Consider the requirement that words be in lowercase. If you scan using the loose %s format to accept all non-white chars, it will require additional processing to check that the chars scanned were lowercase. If the format string applies a scanset of lowercase letters, a non-lowercase input simply fails to scan. It is true that scanf can be a little bit goopy to use, but its powerful features provide a lot of leverage. Try to make the most of them!
Given that you need the count temporarily as an input when computing prevalence and then never again, it seems best to compute and use in a local context without permanently storing it. At the time you read a bigram, you can store its frequency, but will not yet be able to assign its prevalence, as the final total count for the word's bigram is pending. After having read the entire file, run a one-time post-processing step to go back and fill in the field. Iterate over a word's bigrams to sum frequencies into total count, then iterate again to fill in the prevalence field based on the ratio of frequency to total count. Do this for every word in the corpus. The idea is to do some upfront work now to enable later efficient computation of similarity scores.
It will take until next week for us to delve into floating point types and learn how naive equality has issues and why the compiler squawks about it. For now, we're good with naive but hate compiler warnings, so will disguise our intentions from the compiler with this little trick: transform if (a == b) <equal> else <notequal> into the equivalent if ((a < b) || (a > b)) <notequal> else <equal>.
During the calculation of similarity, non-terminating binary decimals 1/3, 3/28, etc. have to be rounded. Depending on which data type is used (float versus double) and the order of operations, that rounding is subtly different and can result in a similarity score ever-so-slightly higher/lower when compared to a calculation done in a different manner. The upshot is that the contents of leader board can be jiggered between you and the sample due to floating point discrepancy. Assume such discrepancies will be ignored in grading. To get reliable outcomes for testing yourself, choose words with denominators (total bigram count) that are powers of 2, those ratios are exactly representable and no rounding will interfere. Next week, we'll dig into what's going on with those wacky floating point types!
Add a candidate to the leader board if there are unfilled slots or if the candidate knocks an existing entry off the board because it has higher priority. Although a 0-similarity candidate might get added to the leader board early in the process, it won't often survive to the final list, but either way, don't make a special case out of adding or not adding it, just treat like all other candidates.
The CMap is very efficient as long as optimized for the proper number of entries (set when calling cmap_create), but performance can critically degrade if configured for the the wrong capacity. If you aren't sure of the eventual number of entries for a CMap, use 0 as the capacity hint when creating it. A 0 capacity hint enables a special adaptive sizing behavior that will maintain efficient performance across a wide range of load.
This symptom could indicate that you have specified the wrong size to cvec_create/cmap_create. To create a container to store elements of structname, use sizeof(structname). If you use a size that is too small, the container will only store the first bytes of the struct, not the entire contents.