Project 1

Bayesian Structure Learning

Expected Due Date: by 5 pm on Friday, October 18th.

This project is a competition to find Bayesian network structures that best fit some given data. The fitness of the structures will be measured by the Bayesian score (described in the course textbook DMU 2.4.1).

Three CSV-formatted datasets have been provided in AA228Student/workspace/project1/. The first row indicates variable names. These datasets are taken from titanicwine and a secret blackbox, respectively. We have discretized the data so that you only have to deal with discrete variables in this assignment.

  1. small.csv 8 variables
  2. medium.csv 12 variables
  3. large.csv 50 variables

The files can be accessed from AA228Student/workspace/project1/ (follow step 1 for Project 0 if you do not have the AA228Student repository). You will try to find the structure for each dataset yielding the highest Bayesian score. The student receiving the highest score will win the competition. The competition results will be posted on the course website after the due date.

Rules

  • Your program should output a file containing the network structure. The output filename should be the same as the input filename, but with a .gph extension, e.g., “small.gph”. Below is an example of what your output structure should look like.
  • A generic example example.gph is provided to you in AA228Student/workspace/project1/.
  • A specific example of a graph for Titanic dataset with only 3 edges (numsiblings ➝ numparentschildren, numsiblings ➝ passengerclass, numparentschildren ➝ sex) will look like titanicexample.gph provided in AA228Student/workspace/project1/.
  • You can use any programming language but you cannot use any package directly related to structure learning. You can use general optimization packages so long as you discuss what you use in your writeup and make it clear how it is used in your code. Recommended packages:
  • Discussions are encouraged, and we expect similar approaches to the project from multiple people, but you must write your own code. Otherwise, it violates the Stanford Honor Code.
  • Submit a README.pdf describing your strategy (please use uppercase for naming). This should not be more than 1 or 2 pages with a description of your algorithm, the time taken for each graph, and the graph plots. Only brief explanations are necessary.
  • Grading Rubric:
    • Small Graph – 10%
    • Medium Graph – 20%
    • Large Graph – 30%
    • README.pdf – 30%
      • Description of algorithm – 10%
      • Running time for each problem – 10%
      • Visualization of each graph – 10%
    • Code – 10%

See this Piazza post for more explicit expectations regarding the grading rubric: https://piazza.com/class/jzrv3rrh5d97q?cid=427

Submission

  1. Ensure you have the AA228Student repository locally—if not, follow step 1 for Project 0.
  2. Navigate to the AA228Student/workspace/project1/ directory.
    • Run git pull to get any updates before starting work.
  3. You should find the three CSV filessmall.csvmedium.csv and large.csv in your Project 1 workspace.
    You’ll also find some starter code for Julia in project1.jl and Python in project1.py as well as some example graphs (example.gph and titanicexample.gph).
  4. Work on the project in the AA228Student/workspace/project1/ directory, or copy your .gph files there.
    • The following files need to be submitted using submit (and located in the project1 directory):
      • small.gphmedium.gph and large.gph.
  5. Run the submit command to record graph scores (from AA228Student/workspace/).
    • submit project1 <YOUR_EMAIL>@stanford.edu <NICKNAME>
      • Note, use ./submit on Linux/Mac OS X and submit on Windows.
      • If you do not provide a <NICKNAME> your SUNet ID will be used (i.e. <YOUR_EMAIL> without “@stanford.edu”)
    • It will complain if any files are missing and tell you your score for each .gph file. It will also submit your .gph files for evaluation and place it on the leaderboard.
    • Check your submission status with:
      • submit status <YOUR_EMAIL>@stanford.edu
  6. Important, submit supplemental material using Canvas.
    1. Login to http://canvas.stanford.edu and navigate to this course.
    2. Use the sidebar to go to Assignments. Click on Project 1.
    3. Upload the following files to Project 1:
      • Code file(s), README.pdf, small.gphmedium.gph, and large.gph.

Note: If you are having trouble submitting through AA228Student, we put together a video demonstration.

Supplementary Video Tutorial – We’ve put together a more detailed walkthrough for computing the Bayesian score that may be useful to some of you. The video is available here. Please note that it is not perfectly rehearsed or post-processed, in the interest of getting it out ASAP.

Leaderboard

Click here to view the live leaderboard.

FAQs

This list has been extended from last year to reflect common queries made on Piazza. You may find our query answered here without even needing to wait on Piazza!

  • What programming languages can I use?
    • You may use any language you like! We are only looking for your source code and the output .gph files.
  • I like the competition and leaderboard aspect. Can we use late days for the competition?
    • No. You can only use late days for the general project grading. Any submissions after the deadline will not be considered for the leaderboard.
  • Can we use the bayesian_score function in BayesNets.jl?
    • No. You can’t use any structure learning related packages, so you’ll have to implement your own score function.
  • What’s the higher Bayes score value: -2345.6 or -3456.7?
    • -2345.6
  • What priors are we using?
    • We are using a Uniform Dirichlet Prior (all pseudo-counts \(\alpha_{ijk}=1\)).
  • Can you please explain what’s in the CSV file?
    • The header line in the CSV file gives you the names of all the nodes of the graph. You’ll use them for creating your .gph file. Each row of the CSV file represents a sample from the graph, i.e. the value for each discrete variable. Different variables might have a different number of discrete outcomes. That number is determined by the maximum value for that variable found in the dataset, and the minimum value is 1 for all variables. More explicitly, if the variable takes on values 1, 2, and 5 in the dataset, then the variable has 5 different discrete outcomes.
  • Can we make multiple submissions?
    • YES! But remember, your last submission will be scored and show up on the leaderboard.
  • Can you point us to a survey of structure learning algorithms?
  • Do you have some general advice for the competition?
    • This competition boils down to combining various algorithms and strategies, including algorithms outside of the textbook, while making your code efficient and tuning any hyper-parameters for optimal performance.
  • Are there any runtime constraints on the code submitted for project 1?
    • No, there are no runtime constraints. As long as there is a reasonable attempt for the solution, we expect to give you full credit. Also, you are welcome to use whatever resources you have access to. You should submit code that we could run (but we won’t necessarily run it). If you want to run a long time and get an extra good graph, that’s fine. You will need to report how long you ran your code in your write-up.
  • Do you check for cyclic graphs?
    • Yes, our tester script checks for that.
  • What is the grading criteria, a.k.a. what do I need to do to get full credit?
    • You have to implement your own scoring function. If you implement some structure learning algorithm or a variant, then you’ll get full credit — so long as you fulfill the other requirements on the write-up.
  • Can we submit multiple code files with different algorithm implementations?
    • Yes. Please mention how the graphs compare to each other in your README, and to ensure the best chances in the competition, please make sure that the better performing graphs are most recently submitted through submit.
  • Do we need a specific name for code files, like we do for README and solution files?
    • No specific file name needs to be used. However, making title names clear and mentioning them in your write-up is super helpful for the grader!
  • Does the Bayes score computed through submit include the \(\ln P(G)\) term?
    • No, it does not.
  • What does idx2names mean in the write_gph method?
    • idx2names is the ordering of the node names that you use. Basically, a dictionary that can map the node index to the node name.
  • How do I convert linear indices to subscripts and vice versa?
  • How do I plot the graphs?
    • For Python, Networkx has draw functions. NetworkX also has a function write_dot, which would allow you to use GraphViz to generate the plots: dot -Tpng input.dot > output.png. For Julia, you can use TikzGraphs.jl.