Graphical Models

Forensic Statistics and Graphical Models

Spring Semester, 2016

"FSG", Tuesdays, 13:45--15:30, Snellius building room 405, Nielsbohrweg 1, Leiden.

Advanced Bachelor's level --- Master's level.

* * * * * * * * * * * * * * * *

Slides of the first lecture (introduction) (2015 version)
Slides of the second lecture (some graph theory) (2015 version)
Slides of the third lecture (part 1: forensic DNA background) (2014 version)
Slides of the third lecture (part 2: the rare Y haplotype problem) (2014 version)
Slides of the fourth lecture (conditional independence) (2014 version)
Slides of the fifth lecture (computation on undirected graphs) (2014 version)
Slides of the sixth lecture (hierarchy of propositions, and idioms for argument) (2014 version)

Homework (do this before lecture 2)

1) Install GeNIe and/or Hugin Lite onto your computer, play with some simple Bayes nets / graphical models. Investigate how to do graphical models in R and install the necessary packages. You need: gRbase, gRain, igraph, graph from the CRAN repository; RGraphviz and RBGL (bioconductor repository); and RHugin from its own development page http://rhugin.r-forge.r-project.org. GeNIe is distributed by a commercial company "Bayes Fusion". Follow their links to "academia" and/or "academic downloads" to find the free download of GeNIe.
Here is a little test program in R gRain.R. You can see the results as an R html notebook http://rpubs.com/gill1109/gRain. As input it needs to read the Bayes net DawidEvett.net, Hugin format. Alternatively you might prefer to use the file DawidEvettSN.net in which the nodes have short names. That file was actually created using GeNIe and the same graphical model in GeNIe's own file format is DawidEvett.xdsl. The model in question comes from Dawid and Evett's classic (1997) paper.

Further exercises are illustrated in http://rpubs.com/gill1109/network and http://rpubs.com/gill1109/conversions. In one of these the interface with Hugin is tested, for which you need to install Hugin or Hugin Lite, and RHugin. You will also need a small data file with graphical layout, http://rpubs.com/gill1109/layout.dat. My R code is hardly commented and won't make much sense unless you study, at the same time, chapter 1 of the book "Graphical Models in R " by Hojsgard, Edwards and Lauritzen.

2) Refresh your background understanding of DNA, and learn about forensic DNA profiles using PCR (polymerase chain reaction) and STR (short tandem repeat) loci, e.g. from wikipedia. You need to understand the basic difference between autosomal DNA and Y-chromosome DNA and while you're at it, learn about mitochondrial DNA too.

3) Read about the Meredith Kercher case (Perugia, 2007) e.g. from wikipedia. The following text is taken from an recent book by Peter Gill ("Misleading DNA Evidence: Reasons for Miscarriages of Justice"): There are numerous Web sites about this case. Some support innocence and others support guilt of the defendants. For example, http://themurderofmeredithkercher.com/ campaigns for conviction of the defendants. For a Web site that campaigns for the acquittal of the defendants, see http://knoxdnareport.wordpress.com/. In both Web sites there are very useful links to English translations of the various judgments which the reader may ponder at leisure.
Photographs of the evidence (the knife alleged to be the murder weapon and the bra-clasp) are also made available on the Web sites along with the DNA profiles and various reports.
Chronologically the key reports are:
1. The Massei report (the judges reasoning for the original conviction). English translation at: http://themurderofmeredithkercher.com/PDF/Massei_Report.pdf.
2. Conti-Vecchiotti Report (a report by the defense experts appointed by the court). English version at: http://knoxdnareport.wordpress.com/.
3. The Hellmann report (the judges reasoning for the acquittal). English translation at: http://hellmannreport.wordpress.com/contents/.
4. The Galati-Costagliola appeal (the prosecution argument against the acquittal). English translation at: http://galatiappeal.wordpress.com/.
5. The Supreme Court of Cassation Motivation Report (judges overturn the acquittal and a retrial is ordered). Not available.
At the time of writing, the defendants have been reconvicted and we await the judges' reasoning.
The author says that he will comment on his Web site when that last report becomes available: https://sites.google.com/site/peterdgill/.

* * * * * * * * * * * * * * * *

All the text below here belongs to the 2011 version of the course.

* * * * * * * * * * * * * * * *

Forensic statistics is the art and science of doing statistics in the context of criminal investigation or prosecution. Especially in the latter context, it makes particular demands on the statistician, who is called to communicate to the court the meaning of statistical data with respect to the questions of interest to the court. Judges, jury, defence, prosecution, ... all have different interests and different information. Testimony of a scientific expert, such as a statistician, has to be neutral and ... scientific. Many of the consumers (jury, public, lawyers) have no prior understanding of probability or statistics at all. (For that matter, many have no understanding of science at all.)

The present "dogma of forensic statistics", as I would call it, contends that the task of the statistician is to impart to the court the meaning of a piece of evidence, thought of as statistical, i.e., partly formed by chance processes, by stating its likelihood ratio with respect to typically two important and competing hypotheses, usually referred to as the hypothesis of the prosecution and the hypothesis of the defence. For instance, if we see a measured DNA profile found from some trace of human cells at the scene of a crime as the result of a chance process involving measurement errors, the probability laws of genetics, and so on, we might like to report the ratio:

Prob(observed profile | the organic material comes from the suspect) : Prob(observed profile | the organic material comes from an unknown person, thought of as a random member of the population at large)

Graphical models or Bayes nets are probability models for the dependence structure of a collection of random variables, thought to be related to one another through a directed acyclic graph, each node or vertex representing one of the random variables in question. Their joint probability distribution is built up as follows. Arrange the graph in two dimensions with arrows (connections between nodes) only pointing downwards. First generate all variables corresponding to root nodes (nodes with no connections to them) by drawing them independently according to some specified marginal distributions. Then move down the graph, each time drawing the random variable corresponding to a given node from some specified conditional probability distribution, conditional on the values of the variables corresponding to that node's graph parents -- the nodes with arrows pointing directly to it.

Graphical models turn out to have wonderful probabilistic and computational properties. A beautiful algorithm, which will be one of the highlights of the course, helps us to rapidly and highly accurately compute conditional probabity distributions of some of the variables in the model given values of some of the other variables. Because of the graphical representation they lend themselves very well for communicating between experts from different fields, and laypersons, about the model for the phenomenon at hand.

They are being used more and more in the forensic statistical context, for many reasons (good and bad), as we will see.

The course will be based on two main books, one on graphical models, the other on forensic statistics. These are:

Cowell, Dawid, Lauritzen and Spiegelhalter, Probabilistic Networks and Expert Systems

Aitken and Taroni, Statistics and the Evaluation of Forensic Evidence (2nd edition).

Computation and data analysis are important issues. I will make use of the graphical models package gRain in R, as well as the specialized GeNiE and Hugin Lite applications (some links can be found below). In fact, it will be important to move back and forth between the possibilities for graphical interaction of the latter specialized applications, and the general statistical computations possibilities of R. Participants may prefer to use Matlab or yet other systems.

We will often make use of Bayes rule in odds form: posterior odds equals prior odds times likelihood ratio, in the following way. Two separate graphical models are constructed which both contain nodes representing the evidence at hand. One of the models belongs to the prosection case, the other to the defense case. By adding an artificial root node corresponding to the binary variable: "is the prosecution right, or the defence?" we merge the two models into one. As a matter of convenience, we assign the just mentioned root node the marginal (prior) probability distribution of equal odds, 50-50. We next use the graphical model to compute the conditional distribution of this variable given the actual evidence observed in the case. Because of Bayes' rule and because of our artificial choice of a uniform prior, the posterior odds equals the likelihood ratio, and that is the number which we must communicate to the court.

Of course, it is all not at all as simple as this... The defence is not obliged to offer a detailed theory "explaining" the evidence which is brought to the court; and anyway, both prosecution and defence will hardly rarely find themselves in the situation that they "know" their models exactly. At best, there will be unknown nuisance parameters all over the place. How to deal with that problem? So far, there is no answer...

Exercise. Given two graphical models, where we identify some of the nodes as representing the same random variables according to two different probabilistic mechanics, show how to merge the two models into one, so that we can compute the likelihood ratio for evidence, namely the values of certain of the common variables, in the way just described.

Notes. The book of Cowell et al. does not contain the proof of the Hammersley-Clifford theorem but refers to Lauritzen's 1996 book. Because the proof is so neat I have written it out here.

* * * * * * * * * * * * * * * *

Below is some further material originally written for the 2007 version of the course, which concentrated on graphical models with some forensic applications. Now however I am going to start by explaining what the special nature of forensic statistics is in general, and then teach graphical models as just one popular and important tool in the field. But much of the information may still be useful.

Here are two sets of slides of introductory talks:

forensic_statistics.pdf, talk by RDG giving overview "what is forensic statistics"
Lauritzen_EMS.pdf, talk by Steffen Lauritzen at European Meeting of Statisticians at Toulouse, 2009, about graphical models for analyzing DNA mixtures.
In the first (introductory) lecture of the present course I referred to two specific recent Dutch cases where the analysis of DNA mixtures was crucial: "The Deventer Murder Case (the widow Wittenberg)", and "The case of Tamara Wolvers (Alphen aan den Rijn)". In both cases I am pretty sure that a miscarriage of justice followed from a wrong analysis of a DNA mixture. To be more precise: I believe that the wrong conclusions were drawn from the DNA evidence.

Here is a link to a preliminary version of a paper by Laurtizen and collaborators on DNA mixture analysis link to a nearly finished paper.

Old course description, to be rewritten. In the course we will study theory and applications of graphical models (wikipedia/Graphical_model). In statistics, a graphical model specifies conditional independence relations among a set of random variables, some observable, some unobservable. It thereby provides statistical models for the joint distribution of the observed variables. The graph not only provides an attractive visual representation of the model but also serves as a computational tool.

For applications, we will focus on genetics and forensic science, where graphical models have proven to be particularly effective, since the laws of genetic inheritance are very neatly expressed in graphical models.

From the point of view of probability theory, conditional independence is a Markov property, and graphical models are "just" Markov fields.

In computer science, the same graphs are used to represent causality and are there called Bayes nets.

Literature:

The definitive resource for the mathematical foundations of the theory of graphical models (a number of chapters of which are essential reading) is the book

S.L. Lauritzen (1996), Graphical Models, Clarendon Press, Oxford, United Kingdom.

A very nice introduction built around applications in genetics is

Lauritzen and Sheehan (2003),
Graphical Models for Genetic Analyses, Statistical Science 18, 489--514.

See also George and Thompson (2003), Discovering Disease Genes, Statistical Science 18, 515--531.

Another really important book, but with a very different approach, is Judea Pearl (2000), Causality -- Models, Reasoning, and Inference, Cambridge University Press.

Slides of the lectures so far: html, pdf

Workform, examination:

The course will include assignments, papers, presentations by the students; the final evaluation will be in a "mondeling" (viva voce?) examination. Incidently, since the topic allows many different accents to be made (probabilistic, statistical, algorithmic, ...) the participants will also be able to influence the choice of topics.

Web resources:

On internet you will easily find a wealth of material on graphical models and/or Bayes nets. Here are just a few links.

Tutorial on Graphical Models: Kevin Murphy's tutorial.

Interesting course on graphical models, many useful links and resources: Helsinki course.

Free computer package for Bayes nets, unfortunately only available for Windoze, GeNIe.
GeNIe runs well via WINE on linux (intel machines), including the fantastic new intel macs -- you can use DARWINE and stay inside OS X if you are not into Parallels Virtual Desktop or dual booting with windoze.

Much work is being done to give us graphical models in R.

Back to my homepage

gill@math.leidenuniv.nl