Forensic Statistics and Graphical Models
Spring Semester, 2016
"FSG", Tuesdays, 13:45--15:30, Snellius building room 405, Nielsbohrweg 1, Leiden.
Advanced Bachelor's level --- Master's level.
* * * * * * * * * * * * * * * *
Slides of the first lecture (introduction) (2015 version)
Slides of the second lecture (some graph theory) (2015 version)
Slides of the third lecture (part 1: forensic DNA background) (2014 version)
Slides of the third lecture (part 2: the rare Y haplotype problem) (2014 version)
Slides of the fourth lecture (conditional independence) (2014 version)
Slides of the fifth lecture (computation on undirected graphs) (2014 version)
Slides of the sixth lecture (hierarchy of propositions, and idioms for argument) (2014 version)
Homework (do this before lecture 2)
1) Install GeNIe and/or Hugin Lite onto your computer,
play with some simple Bayes nets / graphical models.
Investigate how to do graphical models in R and install the necessary packages.
You need: gRbase, gRain, igraph, graph from the CRAN repository;
RGraphviz and RBGL (bioconductor repository);
and RHugin from its own development page
http://rhugin.r-forge.r-project.org.
GeNIe is distributed by a commercial company "Bayes Fusion".
Follow their links to "academia" and/or "academic downloads" to find the free download of GeNIe.
Here is a little test program in R gRain.R.
You can see the results as an R html notebook
http://rpubs.com/gill1109/gRain.
As input it needs to read the Bayes net
DawidEvett.net, Hugin format. Alternatively
you might prefer to use the file DawidEvettSN.net
in which the nodes have short names.
That file was actually
created using GeNIe and the same graphical model in GeNIe's own file format is
DawidEvett.xdsl.
The model in question comes from Dawid and Evett's
classic (1997) paper.
Further exercises are illustrated in
http://rpubs.com/gill1109/network
and
http://rpubs.com/gill1109/conversions.
In one of these the interface with Hugin is tested, for which you need
to install Hugin or Hugin Lite, and RHugin.
You will also need a small data file with graphical layout,
http://rpubs.com/gill1109/layout.dat.
My R code is hardly commented and won't make much sense unless you study, at the
same time, chapter 1 of the book "Graphical Models in R "
by Hojsgard, Edwards and Lauritzen.
2) Refresh your background understanding of DNA, and learn about forensic
DNA profiles using PCR (polymerase chain reaction) and STR (short tandem repeat)
loci, e.g. from wikipedia. You need to understand the basic difference between
autosomal DNA and Y-chromosome DNA and while you're at it,
learn about mitochondrial DNA too.
3) Read about the Meredith Kercher case (Perugia, 2007) e.g. from wikipedia.
The following text is taken from an recent book by Peter Gill
("Misleading DNA Evidence: Reasons for Miscarriages of Justice"):
There are numerous Web sites about this case.
Some support innocence and others support guilt of the defendants.
For example,
http://themurderofmeredithkercher.com/
campaigns for conviction of the defendants.
For a Web site that campaigns for the acquittal of the defendants,
see http://knoxdnareport.wordpress.com/.
In both Web sites there are very useful links to English translations
of the various judgments which the reader may ponder at leisure.
Photographs of the evidence (the knife alleged to be the murder weapon and the bra-clasp)
are also made available on the Web sites along with the DNA profiles and various reports.
Chronologically the key reports are:
1. The Massei report (the judges reasoning for the original conviction).
English translation at: http://themurderofmeredithkercher.com/PDF/Massei_Report.pdf.
2. Conti-Vecchiotti Report (a report by the defense experts appointed by the court).
English version at: http://knoxdnareport.wordpress.com/.
3. The Hellmann report (the judges reasoning for the acquittal).
English translation at: http://hellmannreport.wordpress.com/contents/.
4. The Galati-Costagliola appeal (the prosecution argument against the acquittal).
English translation at: http://galatiappeal.wordpress.com/.
5. The Supreme Court of Cassation Motivation Report (judges overturn the acquittal and a retrial is ordered).
Not available.
At the time of writing, the defendants have been reconvicted and we await the judges' reasoning.
The author says that he will comment on his Web site when that last report becomes available:
https://sites.google.com/site/peterdgill/.
* * * * * * * * * * * * * * * *
All the text below here belongs to the 2011 version of the course.
* * * * * * * * * * * * * * * *
Forensic statistics is the
art and science of doing statistics in the context of criminal
investigation or prosecution. Especially in the latter context, it
makes particular demands on the statistician, who is called to
communicate to the court the meaning of statistical data with respect
to the questions of interest to the court. Judges, jury, defence,
prosecution, ... all have different interests and different
information. Testimony of a scientific expert, such as a statistician,
has to be neutral and ... scientific. Many of the consumers (jury,
public, lawyers) have no prior understanding of probability or
statistics at all. (For that matter,
many have no understanding of science at all.)
The present "dogma of forensic statistics", as I would call it,
contends that the task of the statistician is to impart to the court
the meaning of a piece of evidence, thought of as statistical, i.e.,
partly formed by chance processes, by stating its likelihood ratio with
respect to typically two important and competing hypotheses, usually
referred to as the hypothesis of the prosecution and the hypothesis of
the defence. For instance, if we see a measured DNA profile found from
some trace of human cells at the scene of a crime as the result of a
chance process involving measurement errors, the probability laws of
genetics, and so on, we might like to report the ratio:
Prob(observed profile | the organic material comes from the suspect) :
Prob(observed profile | the organic material comes from an unknown
person, thought of as a random member of the population at large)
Graphical models or
Bayes nets are
probability models for the dependence structure of a collection of
random variables, thought to be related to one another through a
directed acyclic graph, each node or vertex representing one of the
random variables in question. Their joint probability distribution is
built up as follows. Arrange the graph in two dimensions with arrows
(connections between nodes) only pointing downwards. First generate all
variables corresponding to root nodes (nodes with no connections
to them)
by drawing them independently according to some specified marginal distributions.
Then move down
the graph, each time drawing the random variable corresponding to a
given node from some specified conditional probability distribution,
conditional on the values of the variables corresponding to that node's
graph parents --
the nodes with arrows pointing directly to it.
Graphical models turn out to have wonderful probabilistic and
computational properties. A beautiful algorithm, which will be one of
the highlights of the course, helps us to rapidly and highly accurately
compute conditional probabity distributions of some of the variables in
the model given values of some of the other variables. Because of
the graphical representation they lend themselves very well for
communicating between experts from different fields, and laypersons,
about the model for the phenomenon at hand.
They are being used more and more in the forensic statistical context,
for many reasons (good and bad), as we will see.
The course will be based on two main books, one on graphical models,
the other on forensic statistics. These are:
Cowell, Dawid, Lauritzen and Spiegelhalter, Probabilistic Networks and Expert Systems
Aitken and Taroni,
Statistics and the Evaluation of Forensic Evidence (2nd edition).
Computation and data analysis are important issues. I will make use of
the graphical models package gRain in R, as well as the specialized GeNiE
and Hugin Lite applications (some links can be found below). In fact,
it will be important to move back and forth between the possibilities for
graphical interaction of the latter specialized applications, and the
general statistical computations possibilities of R. Participants may
prefer to use Matlab or yet other systems.
We will often make use of Bayes rule in odds form: posterior odds
equals prior odds times likelihood ratio, in the following way. Two
separate graphical models are constructed which both contain nodes
representing the evidence at hand. One of the models belongs to the
prosection case, the other to the defense case. By adding an artificial
root node corresponding to the binary variable: "is the prosecution
right, or the defence?" we merge the two models into one. As a matter
of convenience, we assign the just mentioned root node the marginal
(prior) probability distribution of equal odds, 50-50. We next use the
graphical model to compute the conditional distribution of this
variable given the actual evidence observed in the case. Because of
Bayes' rule and because of our artificial choice of a uniform prior,
the posterior odds equals the likelihood ratio, and that is the number
which we must communicate to the court.
Of course, it is all not at
all as simple as this... The defence is not obliged to offer a detailed
theory "explaining" the evidence which is brought to the court; and
anyway, both prosecution and defence will hardly rarely find themselves
in the situation that they "know" their models exactly. At best, there
will be unknown nuisance parameters all over the place. How to deal
with that problem? So far, there is no answer...
Exercise. Given two graphical
models, where we identify some of the nodes as representing the same
random variables according to two different probabilistic mechanics,
show how to merge the two models into one, so that we can compute the
likelihood ratio for evidence, namely the values of certain of the
common variables, in the way just described.
Notes. The book of Cowell et al. does not contain the proof of the
Hammersley-Clifford theorem but refers to Lauritzen's 1996 book.
Because the proof is so neat I have written it out
here.
* * * * * * * * * * * * * * * *
Below is some further material originally written for the 2007 version of the course, which
concentrated on graphical models with some forensic applications. Now however
I am going to start by explaining what the special nature of forensic statistics is
in general, and then teach graphical models as just one popular and
important tool in the field. But much of the information may still be useful.
Here are two sets of slides of introductory
talks:
forensic_statistics.pdf, talk by
RDG giving overview "what is forensic statistics"
Lauritzen_EMS.pdf, talk by Steffen
Lauritzen at European Meeting of Statisticians at Toulouse, 2009,
about graphical models for analyzing DNA mixtures.
In the first (introductory) lecture of the present course I referred to
two specific recent Dutch cases where the analysis of DNA mixtures
was crucial: "The Deventer Murder Case (the widow Wittenberg)",
and "The case of Tamara Wolvers (Alphen aan den Rijn)". In both cases
I am pretty sure that a miscarriage of justice followed from a wrong
analysis of a DNA mixture. To be more precise: I believe that the
wrong conclusions were drawn from the DNA evidence.
Here is a link to a preliminary version of a paper
by Laurtizen and collaborators
on DNA mixture analysis
link to a nearly finished paper.
Old course description, to be rewritten.
In the course we will study theory and applications of graphical models
(wikipedia/Graphical_model).
In statistics, a graphical model specifies conditional independence relations
among a set of random variables, some observable, some unobservable.
It thereby provides statistical models for the joint distribution of
the observed variables. The graph not only provides an attractive visual
representation of the model but also serves as a computational tool.
For applications, we will focus on genetics and forensic science, where
graphical models have proven to be particularly effective, since
the laws of genetic inheritance are very neatly expressed
in graphical models.
From the point of view of probability theory, conditional independence is
a Markov property, and graphical models are "just" Markov fields.
In computer science, the same graphs are used to represent causality
and are there called Bayes nets.
Literature:
The definitive resource for the mathematical foundations
of the theory of graphical models
(a number of chapters of which are essential reading)
is the book
S.L. Lauritzen (1996),
Graphical Models,
Clarendon Press, Oxford, United Kingdom.
A very nice introduction built around applications in genetics is
Lauritzen and Sheehan (2003),
Graphical Models for Genetic Analyses,
Statistical Science
18, 489--514.
See also George and Thompson (2003),
Discovering Disease Genes,
Statistical Science
18, 515--531.
Another really important book, but with a very different approach,
is Judea Pearl (2000),
Causality -- Models, Reasoning, and Inference,
Cambridge University Press.
Slides of the lectures so far: html,
pdf
Workform, examination:
The course will include assignments, papers, presentations by the students;
the final evaluation will be in a "mondeling" (viva voce?) examination. Incidently,
since the topic allows many different accents to be made (probabilistic,
statistical, algorithmic, ...) the participants will also be able to
influence the choice of topics.
Web resources:
On internet you will easily find a wealth of material on graphical models
and/or Bayes nets. Here are just a few links.
Tutorial on Graphical Models:
Kevin Murphy's
tutorial.
Interesting course on graphical models, many useful links
and resources:
Helsinki course.
Free computer package for Bayes nets, unfortunately only available for
Windoze,
GeNIe.
GeNIe runs well via WINE on linux (intel machines), including the
fantastic new intel macs -- you can use DARWINE and stay inside OS X if
you are not into Parallels Virtual Desktop or dual booting with windoze.
Much work is being done to give us
graphical models in R.
Back to my homepage
gill@math.leidenuniv.nl