Project 1 - Bayesian Structure Learning

Due date: October 18th at 5pm

This project is a competition to find Bayesian network structures that best fit some given data. The fitness of the structures will be measured by the Bayesian score (described in the course textbook DMU 2.4.1).

Three csv-formatted datasets are provided. The first row indicates variable names. These datasets are taken from titanic, wine and a secret dungeon respectively. We have discretized the data so that you would only have to deal with discrete variables in this assignment.

small.csv 8 variables
medium.csv 12 variables
large.csv 50 variables

You will try to find the structure for each dataset yielding the highest Bayesian score. The student receiving the highest score will win the competition. The competition results will be posted on the course website after the due date.

Rules

Your program should output a file containing the network structure. The output filename should be the same as the input filename, but with a .gph extension, e.g., “small.gph”. Below is an example of what your output structure should look like.
Generic example example.gph is provided to you on Vocareum.
Specific example of a graph for Titanic dataset with only 3 edges (numsiblings ➝ numparentschildren, numsiblings ➝ passengerclass, numparentschildren ➝ sex) will look like titanicexample.gph provided on Vocareum.
You can use any programming language but you cannot use any package directly related to structure learning. You can use general optimization packages so long as you discuss what you use in your writeup and make it clear how it is used in your code. Recommended packages:
- LightGraphs.jl for Julia
- NetworkX for Python
- For reading in the CSV files, you can use DataFrames.jl for Julia and Pandas for Python
Discussion is encouraged but you must write your own code. Otherwise, it violates the honor code.
Submit a README describing your strategy.

Submission

Login to http://canvas.stanford.edu
Use sidebar to go to Assignments. Click on Assignment 1.
Use the button at the bottom which says Load Project 1 - Bayesian Structure Learning to go to Vocareum.
You should find the three CSV files: small.csv, medium.csv and large.csv in your workspace. You’ll also find some starter code for Julia in project1.jl and Python in project1.py
Either use Vocareum’s editor to work on the project. Or upload your code along with the following files to your workspace:
- README, small.gph, medium.gph and large.gph.
- The README should briefly describe your strategy. This should not be more than 1 or 2 pages with description of your algorithm, time taken for each graph and the graph plots.
Click on the Submit button
It will complain if any files are missing and tell you your score for each .gph file. It will also submit your .gph files for evaluation.

FAQs

Can we use the bayesian_score function in BayesNets.jl?
- No. You can’t use any structure learning related packages. So you’ll have to implement your own score function.
What’s the higher Bayes score value: -2345.6 or -3456.7?
- -2345.6
What priors are we using?
- We are using a Uniform Dirichlet Prior (all pseudocounts \(\alpha_{ijk}=1\)).
Can you please explain what’s in the CSV file?
- The header line in the CSV file gives you the names of all the nodes of the graph. You’ll use them for creating your .gph file. Each row of the CSV file represents a sample from graph i.e. the value for each discrete variable. Different variables might have different number of discrete outcomes. That number is determined by the maximum value for that variable found in the dataset, and the minimum value is 1 for all variables. More explicitly, if the variable takes on values [1, 2, 5] in the dataset, then the variable has 5 different outcomes.
Can we make multiple submissions?
- YES! But remember, your last submission will be scored and show up on the leaderboard.
Can you point us to a survey of structure learning algorithms?
- YES! See Structure Learning of Probabilistic Graphical Models: A Comprehensive Survey.
Do you have some general advice for the competition?
- This competition would boil down to how you can combine various algorithm strategies, including algorithms outside of the textbook, while making your code more efficient and optimal for every iteration, as well as finding the right hyper-parameters to tune for your algorithm.