Due date: Mon Apr 11 11:59 pm - Hard deadline: Wed Apr 13 11:59 pm
Assignment by Julie Zelenski, with ideas borrowed from Rich Pattis (UC Irvine) and Owen Astrachan (Duke)
[Quick Links: Implementation details, Advice page, Grading]
Completing this assignment will jump-start your proficiency in:
You love The C Programming Language so much that you own 10 copies of it, but your evil nemesis has cruelly taken a pair of scissors to your entire collection and shredded them all. Amidst your tears, you attempt to reassemble the fragments to recreate the original. Little did you know that solving this problem would prep you for a career in computational biology and international espionage!
You are to write a program to read a input of text fragments and reassemble them. The fragments were created by duplicating a document many times over and chopping each copy into pieces. The process of reconstruction finds overlapping sequences and aligns them to match.
This technique is used for genome shotgun sequencing in the Human Genome Project. A DNA strand is cloned many times and randomly cut into chunks of a manageable size that can be sequenced. The reassembly process aligns those sequenced fragments to reconstruct the original. Nova had a fascinating show on Cracking the Code of Life about the race to decode the human genome. In addition to biology, a number of intelligence-gathering tasks involve reassembly. The carpet weavers piecing together photos in Argo are based in truth, archivists are reassembling Stasi trash, and the team "All Your Shreds are Belong to U.S." (great name!) won the DARPA-sponsored shredder challenge. All of these tasks are based on a reassembly process that finds overlap and merges into a unified whole, just as you will do for this assignment.
The program reads a collection of fragments and then processes them in a series of rounds. Each round examines all possible pairings and for each pair considers all possible alignments. This identifies the pair with the longest overlap. Those two fragments are then aligned at the point of maximal overlap and merged into a superstring that has the two original fragments as substrings. Each round decreases the count of fragments by one. The process repeats until it reaches a single result.
Consider a collection with the four fragments shown below (extra spaces were inserted between letters for clarity):
a l l i s w e l l
e l l t h a t e n
h a t e n d
t e n d s w e l l
On the first round, the longest overlap is a 6-character overlap between the second and third fragments:
e l l t h a t e n
h a t e n d
These two fragments merge to the result below. This merged fragment is added to the collection; the two fragments it was composed from are discarded.
e l l t h a t e n d
On the next round, the longest overlap is 5 characters when these two fragments are aligned as below:
e l l t h a t e n d
t e n d s w e l l
These fragments are removed and replaced with their merged result:
e l l t h a t e n d s w e l l
The last round merges the two remaining fragments:
a l l i s w e l l
e l l t h a t e n d s w e l l
This is the final result:
a l l i s w e l l t h a t e n d s w e l l
A match is also possible when one fragment is completely contained within another. Consider:
s w e l l t h a
e l l
The second fragment is entirely contained within the first. When these two fragments are merged, the result is simply the longer string.
Ties are broken arbitrarily. If more than one pair have the same maximal length overlap, either can be selected as the winner for that round. If a pair has two equally-maximal alignments (e.g. abxy
and xyab
can merge to either abxyab
or xyabxy
), either can be chosen. If a pair has no overlap, they are merged by concatenation, with your choice of order (e.g. ab
and xy
can merge to either abxy
or xyab
). Either option is considered a correct reassembly.
A little computer science theory aside. Finding the optimal reassembly is equivalent to the shortest common superstring, a classic problem in theoretical computer science. Given a set of strings S, find the shortest string that contains all the strings in S as substrings. The shortest superstring program is NP-hard, which means no efficient, polynomial-time solution is believed to exist. We are not asking your program to find an optimal result--- this difficult problem is much too computationally expensive. Instead, we approximate using a greedy strategy that finds a local maximum and optimistically pursues this locally optimal choice in the hopes that it will lead to the globally optimal result. This will find a common superstring, but it is not guaranteed to be the shortest.
Usage. The program can be invoked either with a single argument or no arguments. If an argument is given, it is the name of the fragment file. If the named file cannot be opened, the program responds with a helpful error message and exits. If no argument is given, the program reads fragments from stdin; this usage allows the user to enter fragments at the command-line. If the fragments are successfully read (whether from file or stdin), the program reassembles and prints the merged result. (Note: the sample solution responds to a -r
flag that causes it to read the fragments and stop without reassembling. This was a little aid for assign0; your assign1 is not expected to implement this feature.)
Fragment input format. A collection of fragments follows this format:
{fragment 1}{fragment 2} {fragment 3} ... {fragment N}
Each fragment is a sequence of characters wrapped in curly braces. In between fragments (outside the braces), there can be arbitrary amounts of whitespace (tab, space, newline). The body of a fragment can contain any character other than braces. A fragment must contain at least 1 character and at most 1000 characters. A fragment file contains at most 5,000 fragments.
Malformed inputs. A robust program behaves reasonably even in the presence of incorrect inputs. Specific invalid inputs we may test on include a malformed fragment that doesn't fit the required format {chars}
or a fragment longer than the 1000-character maximum. Your program should detect these situations, report the problem to the user, and exit. An error message is expected to be specific, informative, and actionable. We will not test on files that exceed the maximum number of fragments, i.e., you may ignore the possibility of file containing more than 5,000 fragments. An input that consists of zero or one fragment reassembles to itself; these are considered valid inputs and should not result in an error.
Output conformance. The expected output is the correctly reassembled result with no other additional printing. Use our sanity check tool to verify that your output is conforming. Your error messages are not required to exactly match the sample solution, but any alternate wording must be equally informative and actionable.
Memory. Any normal execution path is expected to run cleanly under Valgrind with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
Efficiency. We have modest requirements for efficiency. Use our sample solution as the benchmark of what constitutes reasonable use of time and memory. The sample is implemented in a straightforward manner and employs no trickery. If your performance is roughly comparable to the sample, you're golden.
Design/style/readability. From CS106, you should be well-versed in good coding practice and we expect your solutions to show thoughtful design and attention to readability. You know the drill: decompose into manageable functions, comment where appropriate, factor common code to avoid duplication, no global variables, strive for readability in all ways, and so on. See Nick Parlante's stolen prose on landmarks in coding quality.
We wrote a companion page of advice and hints for this assignment -- be sure to check it out! Go to advice/FAQ page
Please read our page on how we grade assignments to acquaint yourself with the grading system we use. Below is the tentative grading rubric for this assignment. The autograder functionality tests are scored in points; the human grader's code review is scored into buckets to emphasize the qualitative features of the review over the quantitative.
Functionality (105 points)
Code review (buckets together weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:
Style and readability We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
Use of pointers and memory We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on.
Program design We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. When the C library provides needed functionality, you should leverage the existing library functions rather than re-implement that functionality.
Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned by the submission.
The assign1 project contains the source file reassemble.c
and a Makefile. These files are distributed as a Mercurial repository. Get the starter project by cloning your cs107 repository using the command
hg clone /afs/ir/class/cs107/repos/assign1/$USER assign1
The $USER shell variable automatically expands to your sunet id. If there is no repo for your sunet, this means you were not registered when the repos were distributed, please register asap and send email to cs107@cs asking us to create your repo. There is a public 'guest' repo that can be accessed by auditors, the guest repo is read-only and does not have submit privileges.
Add your code in reassemble.c. You should not need to edit our given Makefile. Given the small scope of this project, maintaining all code in one file is perfectly appropriate, do not divide the code across multiple files.
The directory samples/assign1
in our class directory contains a sample executable and fragment files. You can access these via their full path (e.g. /afs/ir/class/cs107/samples/assign1/reassemble_soln)
, but it is a bit unwieldy. Your repo contains slink
, a symbolic link to the directory, which you can use to more easily refer to it (e.g. slink/reassemble_soln
). Note that the samples directory is shared and read-only, so you will not be able to add or edit files within that directory.
Your final step is to submit your finished code for grading. We strongly recommend you do a trial submit well in advance of the deadline to familiarize yourself with the process and allow time to work through any snags. You can replace that submission with a subsequent one if desired.
How CS107 handles deadlines and late work may differ from other classes, be sure you have read our complete late policy. Submissions received before the due date earn the on-time bonus. If you miss the due date, late work may be submitted up until the hard deadline without penalty. No submissions will be accepted after the hard deadline, please plan accordingly!
How did it go for you? Review the post-task self-check.
Break a vase, and the love that reassembles the fragments is stronger than that love which took its symmetry for granted when it was whole.
---Derek Walcott