Assignment Grading

Written by Julie Zelenski, with modifications by Nick Troccoli

We know that you will invest a lot of time into completing the programming assignments and the results of those efforts will be your biggest source of accomplishment in this course. We view grading as an important commitment we are making to help you as a programmer. Our grading process is designed to thoroughly exercise your program and the staff will review your code to provide comprehensive and thoughtful feedback. Our grade report will highlight the strengths of your work, as well as point out the ways in which you can improve. Here is some more information about our standards and grading process.

We evaluate each submission in two broad categories: functionality and style ("code review"). For functionality, we assess your program's effectiveness from an external perspective. We are not looking at code, but testing its behavior. For style, we apply quality metrics and do a individual review of your code to appraise its design and readability.

Functionality

Functionality measures how successfully the program executes on a comprehensive set of test cases. We create our test suite by working from the original program specification to identify a list of expected behaviors and write a test case for each. We use the autotester to run a submission on each test and award points for each successful result. Thus, the resulting functionality score is a direct reflection of how much observably correct behavior your program exhibited. This process is completely automated; the grader does not search your code to find bugs and deduct for them, nor do they appraise the code and award points for tasks that are attempted or close to correct.

Our functionality test suite is a mix of:

  • sanity
    Sanity is intended to verify basic functionality on simple inputs. These tests are published to you as the sanity check tool.
  • comprehensive
    This is a battery of tests, each targeted at verifying a specific functional requirement, such as case-insensitive handling or the expected response for an unsuccessful search. Taken together, the comprehensive tests touch on all required functionality. We architect these tests to be as independent and orthogonal as possible to avoid interference between features.
  • robustness
    These tests verify the graceful handling of required error conditions and proper treatment of exceptional inputs.
  • stress
    The stress cases push the program harder, supplying large, complex inputs and exercising program features in combination. Anything and everything is fair game for stress, but the priority is on mainstream operation with little focus on extreme/exceptional conditions.

You may ask: does this scheme double-ding some errors? Yes, a program that fails a comprehensive case often fails a stress case due to the same root cause. But this bug was prevalent in both simple and complex scenarios, which means there was more opportunity to observe it and it was more detrimental to the program's overall correctness. In other words, we are ok with applying both deductions and have configured the point values with this in mind. Contrast this with a bug that only triggers in a narrow context and thus fails just one test. The smaller deduction indicates this bug was more obscure, harder to uncover, and only slightly diminished correctness. The total functionality score is not computed by counting up bugs and penalizing them; we award points for observable, correct functionality. A submission could lose a large number of points for several failures all stemming from a single, critical off-by-one error; another submission can lose the same number of points due to a plethora of underlying bugs. Both submissions received the same score because they demonstrated the same amount of "working/not working" behavior when tested.

What about partial credit? Earning partial credit depends on having correctly implemented some observable functionality. Consider two incomplete submissions. One attempts all the required functionality but falls short (imagine it had a tiny flaw in the initial task, such as reading the input file, which causes all output to be wrong despite that the rest of the code may have been totally correct). This submission would earn few functionality points. The second attempts only some features, but correctly implements those. This submission can earn all the points for the cases that it handles correctly, despite the broken/missing code for the additional requirements. This strongly suggests a development strategy where you add functionality feature-by-feature, not attempting to implement all the required functionality in one fell swoop. This is the "always have a working program" strategy. During development, the program will not yet include all the required features, but the code so far works correctly for the features that are attempted. From there, extend the program to handle additional cases to earn additional functionality points, taking care not to break any of the currently working features.

The Autotester

Understanding of the process we use can help you properly prepare your project for submission and grading. Each test is identified with a short summary description of what was being tested. If you would like to know more about a particular case, you may ask in email or better, come to helper hours where we can manually re-run the test and walk you through it. Note that we do not give out test cases as part of maintaining a level playing field for future students.

Output conformance is required. The autotester is just a fancy version of our sanity check tool. It executes your program and compares its output to what is expected. It is vital that your program produce conformant output in order to be recognized as the exact match the autotester is looking for. If you change the format or leave stray debugging printfs behind, your output may be misread. To avoid this happening to you, be sure to follow the assignment spec exactly, use sanity check and follow through to resolve any discrepancies. We do not adjust scores that were mis-evaluated because the submission didn't conform to sanity check.

Pass/Fail Scoring. Each automated test is scored as either passed or failed, without gradations for "more" or "less wrong" as such distinctions are difficult to automate. Is missing some lines of output better than having the right lines in the wrong order or producing the correct output then crashing? Both are scored by the autotester as incorrect for that test.

Timeout. The autotester employs a hard timeout to avoid stalling on potentially infinite loops. The hard timeout is generous but not unbounded (typically 10x the sample). A grossly inefficient program that executes more than an order of magnitude more slowly than the reference runs the risk of losing functionality points due to triggering the hard timeout on tests it was unable to complete in time. We do not re-run tests with longer timeouts to accommodate these programs.

Grader Judgment. Most functionality cases are automatically scored without any involvement from the grader. For robustness cases and other error conditions, the autotester observes whether the program detects the problem, how it reports it to the user, and whether it appropriately handles it. If the assignment allows for the output to not match exactly, the autotester defers to the grading TA who makes the call on whether the feedback is sufficiently informative, accurate, and actionable to earn full credit. However, matching the output exactly guarantees full credit from the autotester.

Style

Automated Tests

In addition to the automated tests for functionality, we also evaluate how well your program meets our standards for clean, well-written, well-designed code. Although good quality code is highly correlated with correct functionality, the two can diverge, e.g. a well-written program can contain a lurking functionality flaw or a poorly designed program can manage to work correctly despite its design. Make it your goal for your submission to shine in both areas!

We use automated tests for these quality metrics:

  • Clean Compile

    We expect a clean build: no warnings, no errors. Any error will block the build, meaning we won't be able to the test the program, so build errors absolutely must be addressed before submitting. Warnings are the way the compiler draws attention to a code passage that isn't an outright error but appears suspect. Some warnings are mild/harmless, but others are critically important. If you get in the habit of keeping your code compiling cleanly, you'll never miss a crucial message in a sea of warnings you are casually ignoring. We apply a small deduction if you leave behind unresolved build warnings.

  • Clean Run Under Valgrind

    We look for an error-free, leak-free report from Valgrind. In scoring a Valgrind report, leaks warrant only a minor deduction whereas memory errors are heavily penalized. Anytime you have an error in a Valgrind report, consider it a severe red flag and immediately prioritize investigating and resolving it. Unresolved memory errors can cause all manner of functionality errors due to unpredictable behavior and crashes. Submitting a program with a memory error will not only lose points in the Valgrind-specific test, but runs the risk of failing many other tests that stumble over the memory bug. Leaks, on the other hand, are mostly quite harmless and working to plug them can (and should) be postponed until near the end of your development. Unresolved leaks rarely cause failures outside of the Valgrind-specific test.

  • Reasonable Time and Memory Efficiency

    We measure submissions against the benchmark solution and observe whether it performs similarly both in terms of runtime and memory usage. Our assignment rubric typically designates a small number of points for runtime and memory efficiency. A submission earns full credit by being in the same ballpark as the sample program (i.e. within a factor of 2-3). Our sample is written with a straightforward approach and does not pursue aggressive optimization. Your submission can achieve the same results, and that is what we want to encourage. There is no bonus for outperforming this benchmark and we especially do not want you to sacrifice elegance or complicate the design in the name of efficiency. Note that gross inefficiency (beyond 10x) puts you at risk of losing much more than the designated efficiency points due to the hard timeout on the autotester. if your program is in danger of running afoul of the hard timeout, it is a clear sign you need to bring your attention to correcting the inefficiency to avoid losing points for functionality tests that exceed the hard timeout in addition to the regular efficiency deductions.

    One simple means to measure time efficiency is to run time exc, where exc is the program you wish to benchmark, on both the solution we provide and your executable. You should test your efficiency on large test cases, as those are the places where inefficient algorithms usually perform poorly.

    For memory usage, run the solution under Valgrind and find the total heap usage in the Valgrind report. Run your program under Valgrind on the same test case and compare your total heap usage to that of the solution. If you are in the same ballpark, you're good!

For both the Valgrind and efficiency tests, we try to ignore issues of functional correctness if they don't otherwise interfere with the evaluation. For example, if the program attempts the task and gets the wrong answer, its Valgrind report or runtime can still be evaluated. However, if the program is sufficiently buggy/incomplete (e.g. discards input or crashes), such inconclusive results can lead to loss of points.

Unless otherwise indicated in the problem statement, we do not require recovery, memory deallocation, or other cleanup from fatal errors; you are allowed to simply exit(1).

Manual Code Review

The most important part of the quality feedback is the CA's commentary from the code review. They will read your code from the role of a team manager giving feedback before accepting the code into the team's repository. Our standards should be familiar from CS106: clear, elegant code that is readable, cleanly designed, well-decomposed, and commented appropriately. The review will identify notable issues found when your reading your code and point out the highlights and opportunities for improvement. The CA also assigns a bucket for the key tasks being evaluated, from one of:

  • [+]: An outstanding job; reflects code that is notably clean, elegant and readable, and could be used as course example code for good style.
  • [ok]: A good job; reflects code that demonstrates solid effort and is fairly successful at meeting expectations, but also has opportunities for improvement.
  • [-]: Has larger problems, but shows some effort and understanding. There were either large concerns, or a multitude of smaller concerns, in the submission.
  • [- -]: Shows many significant issues and does not represent passing work.
  • [0]: No work submitted, or barely any changes from the starter assignment.

See the style guide, linked to from the Assignments page, as well as individual homework specifications, for style guidelines to follow for each assignment.

If you have questions about the feedback or need clarifications, please reach out to us via email or helper hours!

Frequently Asked Questions

If I can't get my program working on the general case, can I earn sanity points by hard-coding my program to produce the expected output for the sanity inputs so that it "passes" sanity?

No. Any submission that deliberately attempts to defraud the results of automated testing will be rejected from grading and receive a 0 score for all tests.

Can I get a regrade? I think the autotester results are wrong.

We have invested much time in our tools to try to ensure they evaluate the submissions correctly and consistently and we are fairly confident in them. But given we're human and it's software, mistakes can happen and it is important to us to correct them. If you believe there is a grading error due to a bug in our tools/tests, please let us know so we will investigate further. If there is a problem, we will be eager to fix it and correct the scores for all affected submissions.

My program worked great on sanity check, but still failed many grading tests. Why?

The information from sanity check is only as good as the test cases being run. The default cases supplied with sanity check are fairly minimal. Passing them is necessary, but not sufficient, to ensure a good outcome. You gain broader coverage by creating your own custom tests. The more comprehensive your own custom tests, the more confidence you can have in the results.

I ran Valgrind myself and got a clean report but my assignment grade docked me for errors/leaks. What gives?

We run a Valgrind using one of our larger comprehensive test cases. You should do the same. Getting a clean Valgrind report on a simple test case confirms the correctness of that particular code path, but only that path. If you run Valgrind on a diverse set of code paths, you'll be able to additionally confirm the absence/presence of memory issues lurking in those parts and will not be caught unaware by what we find there.

My indentation looked fine in my editor, but misaligned when displayed for code review. What gives?

If your file has mixed tabs and spaces, the expansion of tabs into spaces can change when loaded into a editor/viewer with settings different than your own. Make sure you have set up your editor correctly according to the guides on the course website, such as any needed configuration files.

Which TA grades my submissions?

We randomly shuffle grading TA per assignment. The functionality tests are all autoscored; the grading TA handles any judgment calls and the code review. The TA who reviewed your submission is shown in the header of the grade report. All TAs work from a shared rubric and we do meta-reviews for consistent application and calibration across graders.

My gradebook entries for each assignment shows functionality score. Does that score include late penalties or points for code review?

No. The functionality total is points earned for autotested cases. Late penalties are factored in at the end of the quarter and your code review is given as as buckets (ok, +, etc). Both components as well as any late penalty contribute to the assignment portion of your overall course grade.