1 Introduction
Given trace objects assumed to originate from a single source, and control objects from a known source, we want to infer if all objects originate from the same source. Formally, we want to test if:

 the set of trace objects and the set of control objects are two simple random samples from the source of the control objects;

 the set of trace objects is a simple random sample from another source in a population of potential sources.
In forensic science, differentiating between these two propositions cannot be reduced down to a simple classification or model selection problem that can be directly addressed by machine learning or other similar techniques:

The required approach needs to address the two competitive hypotheses, and , presented above, from the comparison of two groups of objects (and by learning the characteristics of the source of the control objects from a training sample and performing multiple dependent tests to determine if each individual trace object can be associated with that source as proposed by ASTM E292613 (2013) and Park and Carriquiry (2018)).

The required approach needs to account for the potentially limited number of samples available to the forensic scientist (e.g., 310 observations); hence techniques requiring extensive training are not an option.

Material of forensic interest, such as paint, is often subject to alteration due to their exposure to environmental conditions (e.g., sun, heat); furthermore, it is not reasonable to expect to exhaustively survey the population of paint (or glass, or fibres). This implies that classifiers relying on background training data would need to be retrained every time a new source is considered in casework.

As explained below, the inference process must account for the many sources in the population that are potentially indistinguishable from the source of the control objects.
Legal and scientific scholars advocate for the use of a Bayes factor to quantify the support of the observations made on the trace and control objects in favour of one of these two propositions (see Aitken et al. (2010)
for a comprehensive discussion). Unfortunately, it is often impossible to assign the necessary probability measures to perform likelihoodbased inference for the highdimensional and complex data commonly encountered in forensic science. For example, the random vectors associated with the chemical spectra characterising glass, paint, fibre or dust evidence may have thousands of dimensions and may include different types of data (e.g., discrete, continuous, compositional). Without these probability measures, assigning Bayes factors, or performing any other likelihoodbased inference, is not possible. Consequently, forensic scientists reporting these types of evidence are left without means to support their assessment of the probative value of the evidence.
In this paper, we revisit the twostage inference framework formally introduced by Parker (1966; 1967) and Parker and Holford (1968) by leveraging the properties of kernel functions (Schoelkopf and Smola, 2001) and the results presented by Armstrong et al. (2017). The proposed inference framework relies on a kernel function and, therefore, is particularly suited for highdimensional, complex and heterogenous data. The framework is generic and can easily be tailored to any type of data by modifying the kernel function. Our solution involves algorithms that allow for handling the uncertainty on the model’s parameters, and permit rapid and efficient sampling from the posterior distributions of these parameters. Furthermore, it relies on a single main assumption, which can be satisfied through the design of the kernel function.
Despite some wellknown shortcomings of the twostage approach, discussed later in this paper, we believe that the proposed approach can provide a helpful and rigorous statistical framework to support the inference of the identity of source of trace and control objects described by highdimensional heterogenous random vectors, such as chemical spectra. We used the proposed approach to study the probative value of traces of paint (characterised by FourierTransform Infrared spectroscopy (FTIR)) that may be observed in connection with crimes (e.g., paint present on tools used to force open doors or windows).
2 Overview of Parker’s twostage approach
The general framework of the twostage approach was first briefly mentioned by Kirk (1953) and Kingston (1965), and was formally described by Parker (1966; 1967) and Parker and Holford (1968). Parker breaks down the forensic inference process into two stages, which he describes as the similarity stage, and the discrimination stage. In the similarity stage, the goal is to compare the characteristics of the trace and control objects and determine whether they are distinguishable. As the difference between the sets of characteristics increases, the hypothesis that the trace and control objects originate from the same source is weakened to the point that it can be rejected. However, establishing that the two sets of characteristics are indistinguishable is not sufficient in itself to conclude to the identity of the sources of the two sets of objects. Intuitively, the value of finding that the set of characteristics describing the trace objects is indistinguishable from that of the control objects is a function of the number of sources whose characteristics would also be deemed indistinguishable from the trace objects using the same analytical technique: the lack of distinguishability between trace and control objects is more valuable in cases where very few sources in a population of potential sources share the same characteristics as the trace objects. Thus, the goal of the discrimination stage is to determine the rarity of the characteristics observed in the first stage in the population of potential sources if the two sets of objects are found to be indistinguishable from one another. The level of rarity of the trace characteristics in a population of potential sources is often called a match probability or probability of coincidence.
While occasional early uses of Bayesian inference in the judicial system have been reported before the 1960s
(Taroni et al., 1999), the twostage approach appears to be an initial attempt to formally frame the problem of the inference of the identity of source in forensic science in a logical manner, and to propose a statistically rigorous method to support this inference process. Today, the twostage approach naturally arises as a proxy for the Bayes factor in situations where the measurements made on the trace and control objects are discrete and can easily be compared, such as in singlesource forensic DNA profiling: the DNA profile of a trace is compared with that of a known individual, and, if found similar, the match probability of that profile in a population of potential donors is determined (Butler, 2015).The twostage approach was refined in the context of glass evidence in a series of papers starting in 1977 (Evett, 1977). Today, adhoc implementations of the twostage approach can notably be found in relation to glass, paint and fibre evidence (see for example Curran et al. (1997); Champod and Taroni (1997); Aitken and Lucy (2004); Massonnet et al. (2014); Muehlethaler et al. (2014)). In most cases, the decision to reject the hypothesis that trace and control objects are indistinguishable during the first stage is based on the training and experience of the forensic analyst performing the examination, and the second stage is not considered (Kaye, 2017)
. When it is considered, the determination of the match probability relies, in the best situation, on frequency estimates obtained by determining the size of an illdefined set of objects that are considered to have the “same characteristics” as those of the trace
(Kaye, 2017). Outside of trivial situations with discrete data (e.g., blood typing) or low dimensional continuous data (e.g., refractive index of glass), we have not found a rigorous implementation of the twostage approach that is capable of handling highdimensional and complex forms of evidence, such as chemical spectra or impression and pattern evidence, and we have to agree with the arguments brought forward by Kaye (2017).Below, we propose a formal statistical method to test the hypothesis that two highdimensional and complex sets of observations are indistinguishable (Parker’s similarity stage). We extend the work published by Armstrong et al. (2017) to develop a generic level test for comparing sets of highdimensional, heterogenous random vectors, in which we account for the uncertainty on the model’s parameters, and we propose a computationally efficient algorithm that enables to increase the number of objects considered and to improve the reliability of the test. Because our test relies on kernel functions that can be tailored to any type of data, the same test can be used in multiple situations, irrespective of the type of evidence considered. Finally, our method’s main assumption can be satisfied through the design of the kernel function.
Our method improves upon existing pattern recognition methods that could be considered for addressing this type of problem, such as Support Vector Machines, Artificial Neural Networks, or Random Forests: our method does not require a training set; it allows for comparing sets of objects to each other in a single test (as opposed to comparing individual objects in multiple dependent tests); it permits likelihoodbased inference; and it enables formal statistical hypothesis testing in highdimensional, complex and heterogenous vector spaces.
In this paper, we apply the proposed statistical test to FourierTransform Infrared (FTIR) spectra of paint fragments and we propose a strategy to assess the typeI and typeII errors of the test. We also extend the method to the second stage (Parker’s discrimination stage) and discuss how to assign match probabilities to sets of spectra. Finally, we discuss the benefits and limitations of the twostage approach in the context of making inference on the source of highdimensional complex forms of forensic evidence.
3 First stage: testing indistinguishability
In the first stage of our approach, we wish to test whether a set of trace objects is indistinguishable from a set of control objects. We use an level test to address and . Given the nature of the test, we can only reach one of two conclusions:

The characteristics of the sets of trace and control objects are considered to be sufficiently different. Thus, the decision is that the objects cannot originate from the same source and is accepted. This decision is associated with a rate of erroneously rejecting the hypothesis of common source;

The characteristics of the sets of trace and control objects are within some level of tolerance of each other. Thus, we do not have enough evidence to reject the possibility that the sets of trace and control objects originate from the same source, and so we fail to reject at the chosen level.
We want to reiterate that, in the forensic context, the latter conclusion does not directly imply that the trace and control objects originate from the same source: it merely implies that the sources of the trace and control objects are indistinguishable from each other, based on the considered characteristics and the chosen level. As mentioned above, the value of finding that these sources are indistinguishable can be assessed only in light of the number of sources that would also be found to be indistinguishable from the trace source. Assessing the rarity of the trace’s characteristics is the purpose of the second of the two stages, and is discussed later in this paper.
To statistically test and in the presence of highdimensional, heterogenous and complex data, we extend the results presented by Armstrong et al. (2017) (and summarised in Appendix A) to develop a statistical test using vectors of scores resulting from the crosscomparisons of the trace and control objects.
3.1 level test for vectors of scores
Given two vectors of measurements and representing the observations made on two objects, , a kernel function, , is used to measure their level of similarity and report it as a score, . We note that the kernel function at the core of the proposed model, , can be designed to accommodate virtually any type of data, and should satisfy only two requirements: it must be a symmetric function, that is , and it must ensure that the distribution of the vector of scores
is normally distributed to satisfy the assumption made on the score model by
Armstrong et al. (2017). This assumption is reasonable for highdimensional objects and can be satisfied through careful design of the kernel function (Armstrong, 2017).Given trace objects and control objects, we define the vector of scores , where represents the scores calculated between all pairs of control objects, and represents the scores calculated between all pairs of objects involving at least one of the trace objects.
Since all control objects are known to originate from a single source, we use the results in Armstrong et al. (2017) to assume with , and parameter , where is the expected value of the score between any two objects from the same considered source, and
are the variances of the two random effects, and
is an design matrix where each row represents an combination of objects and consists of ones in the and positions and zeros elsewhere.Furthermore, under , all trace and control objects are assumed to originate from the same source, therefore
(1)  
where is a design matrix of the same construction as , but with dimensions corresponding to the vector . Under , this distribution has the same parameter, , as the distribution of , since the only differences between the distributions are the length of the mean vectors and the dimensions of the design matrices and .
We begin designing the test statistic of our
level test by defining the conditional likelihood of the vector of scores involving at least one trace object, given the vector of scores involving only control objects, . We then define our test statistic as the function(2) 
where is a random vector of scores calculated between pairs of objects involving at least one trace object when the trace objects truly originate from the same source as the control objects. The distribution of , obtained using the structure of the covariance matrix defined in (1), is
(3) 
Using this test statistic, we decide to reject at a specific level if
(4) 
where is a constant chosen to satisfy
(5) 
For a wellbehaved test, by construction of . In practice, there is uncertainty about and the distribution of under is not necessarily uniform. Thus,
enables us to formally control the typeI error rate of our test. The chosen test statistic has some interesting properties:

decreases as the level of dissimilarity between the trace and control objects increases; hence, will tend to 0 as the dissimilarity between trace and control objects increases. Therefore, is a strictly positive function and the test defined in (4) is a left tail test;

only requires to be random and considers fixed. This enables the test statistic to be “anchored” on the characteristics observed on the control objects sampled from the source considered under . In the forensic context, this critical property renders the test specific to the source suspected to have generated the trace fragments.
3.2 Accounting for the uncertainty on under
In most situations, is not known and must be learned from . Armstrong et al. (2017) show that an analytical solution to estimate from exists. Instead of replacing by a point estimate, , in (2), we integrate out the uncertainty associated with the model parameters by considering the posterior distributions of , , and , given . In this context, we decide to reject if
(6) 
The posterior distribution
is not a typical NormalInverseGamma distribution due to the coupling of
and in the covariance matrix of . It is trivial enough to develop a Gibbs sampler to obtain a sample from the distribution. Nevertheless, as we will see in Section 4.1, it is not necessary. The integral in (6) is easily estimated by simulation using Algorithm 1.The output of Algorithm 1 converges to as , since
(7)  
3.3 Determining
In most situations, the distribution of may not be uniform since is unknown. Therefore, we must determine empirically. This can be achieved in several ways depending on whether we want to condition on , or have a decision point that will ensure an average typeI error rate across all possible sources in a population.
Conditioning on implies that the test is specific to the source of the observed control objects. It also presents the advantage that can be entirely determined by resampling scores using (3) and the vector of scores calculated using the observed control objects. However, in this situation, relies heavily on the assumption of normality of the distribution of the scores calculated between objects from the considered source. Furthermore, this strategy assumes that is a typical sample from its distribution. When is far from the expectation of its distribution, or when the distribution is not normal, the typeI and II errors of tests conducted using will vary in unpredictable ways. Alternatively, a sourcespecific can be determined by obtaining a very large number of objects from the considered source and using disjoint subsets of these objects to study the empirical distribution of under . This process has to be repeated for each new test. In most situations, this alternative strategy will be costprohibitive.
The unconditional , obtained using Algorithm 2, has the main advantage that it can be determined for a type of evidence based on a large validation experiment prior to the introduction of the method in casework. By construction, using an unconditional guarantees that the average typeI error for the considered type of evidence is . However, the typeI error rate cannot be finely controlled for a given specific source. Determining using this strategy requires samples from a large number of sources. We note that these samples are required to calculate the power of the test, as well as the match probability in the second stage of the approach, and therefore, should be acquired anyway.
As discussed above, there is a fair possibility that the conditional obtained for a specific source does not correspond to the desired level of the test. While this possibility also exists with the unconditional , the guarantee that the size of the test is on average for the considered evidence type and the ability to determine from a large empirical experiment prompts us to recommend the second approach.
3.4 Power of the test
The power of the test introduced in Sections 3.1 and 3.2 cannot be derived analytically given the dimension of the considered objects and the parameter space of the test statistic. However, it can be determined empirically using a reference library of sources that are known to have different characteristics in the input space (e.g., the same collection of sources that is used to determine in Algorithm 2). Using this library, it is possible to empirically determine the power of the test for fixed numbers of trace and control objects, using Algorithm 3.
We stress that the power of our test for a specific level is not equivalent to the match probability assigned during the second of the two stages of our approach. The power of the test informs on the average probability of erroneously concluding that two sets of objects are indistinguishable as a function of the level of dissimilarity between these two sets. It is determined using sources that are known to have characteristics that are different from each other. The second stage of the approach informs on the casespecific probability that a randomly selected source from a population of potential sources will be a plausible source for the trace objects considered in a case.
4 Computational considerations
Calculating in Algorithm 1 requires posterior samples from , and at each iteration of the algorithm. We face three challenges when calculating for large , , or :

Using a Gibbs sampler to obtain a sample from involves a great many number of iterations to obtain a reasonable sample size for due to the need to account for the burnin period and thinning;

Sampling from requires calculating the determinant and inverse of for each new value of and ; this may quickly become cumbersome depending on the dimension of and the number of samples needed;

Similarly, sampling from requires calculating the determinant and inverse of the conditional covariance matrix in (3) for each new sample ; again, this may become a challenge as the dimension of , the dimension of , and the number of samples required increase.
In the following sections, we propose solutions that allow for removing these computational bottlenecks, and enable us to use Algorithm 1 with large values for , and .
4.1 Posterior sample from
Rather than using a Gibbs sampler to obtain posterior samples from , we capitalise on the fact that the sums of squares, and , used in the estimation of and in Armstrong et al. (2017) are independent, such that
Defining and , we can sample from
(8)  
Assuming InverseGamma conjugate prior distributions for
and , we have that(9)  
Finally, we can obtain a joint sample of and from a sample of and using
(10) 
Similarly, we can obtain a posterior sample for from a joint sample of and by assuming a Normal prior with mean and variance parameters and
(11) 
The covariance matrix is a function of and (Section 3.1). The parameters and of the posterior distribution of are equal to
(12) 
Note that we are not concerned with the choice of the hyperparameters, and that different choices of prior for
may be considered (e.g., subjective, uninformative, or obtained from the empirical study of a large sample from a population of objects).This approach allows us to directly generate samples from . It does not require a burnin period or thinning, and therefore does not waste computational resources. However, this approach still requires calculating the determinant and inverse of for each sample of and to obtain a new sample of .
4.2 Determinant and inverse of
We avoid the computational cost of repeatedly inverting by taking advantage of its spectral decomposition. Armstrong et al. (2017) show that
has three different eigenvalues
(13) 
with respective multiplicities 1, , and . They also show that
(14) 
where and the
are eigenvectors orthogonal to
, such that(15)  
Since is fixed, the general structure of is fixed. Thus, to obtain for any new values of and , only the eigenvalues need to be recalculated. This enables us to efficiently obtain the new value for the determinant of and the inverse of that matrix at each iteration of Algorithm 1.
4.3 Resampling from
To generate samples of scores from in (3), we exploit the properties of the Cholesky decomposition of . We define , where is a lower triangular matrix. It follows that any vector , where and , has mean and covariance , and thus is a sample from . While has to be recalculated for each new sample of , calculating the Cholesky decomposition of is significantly faster than determining its inverse by other methods.
5 Second stage: assigning the probability of match
The focus of the second stage of the approach is to assess the value of finding that the sources of the trace and control objects are indistinguishable from one another (the second stage is not performed when the first stage results in the rejection of the hypothesis of common source at the selected level). This value is a function of the number of sources, in a population of potential sources, that are also indistinguishable from the source of the trace objects. Thus, the second stage aims at assigning a socalled probability of match. Ideally, assigning this probability would require some knowledge of how the characteristics observed on the trace are distributed over the population of potential sources; in turn, this would require defining a likelihood function, which, as mentioned previously, may not exist for most forensic evidence types.
Instead, for the time being, we propose to follow Parker (1967), and repeatedly test whether each source from a sample of sources from a population is indistinguishable from the source of the trace objects using Algorithm 1, keeping fixed. This process is reflected in Algorithm 4. It allows for using the relative frequency of indistinguishable sources as a proxy for the match probability, as is already typically done for single contributor forensic DNA profiles (Butler, 2015). This process consists in performing multiple dependent level tests, since the trace objects are common to all tests. The quality of the relative frequency of indistinguishable sources as an estimate of the match probability depends on how the typeI and typeII errors combine across these different tests. We are currently working on methods to propose a solution to the issue of multiple dependent testing in the context of the proposed model.
6 Application of the proposed method to paint evidence
6.1 Data
In this section, we apply the proposed approach to FourierTransform Infrared spectroscopy (FTIR) spectra of paint chips from cans of common household paint. The paint chips in this example come from 166 different paint cans. For each paint source, we observe seven replicates. Each of the replicates corresponds to a new, distinct observation and is not a repeated measurement on a single paint chip  that is, the seven replicates correspond to seven exchangeable FTIR spectra. Each spectra represents the absorbance of the paint material for a range of wavelengths (from 550 cm to 4,000 cm) and is captured by a 7000dimensional vector.
Since we only observe seven spectra per source, for the purpose of this example, we treat the spectra as functional data and express each one as a linear combination of 300 Bspline bases. We assume that the vectors of basis coefficients are Multivariate Normal, and we use the sample mean and covariance matrix of the coefficients for the seven spectra as point estimates for the parameters of their distribution. This strategy is fitforpurpose in the context of this example, and enables us to “resample” new spectra from a considered source to study the behaviour of our model under different conditions. Figure 1 shows the reasonableness of this approach. It presents seven observed spectra overlaid with seven simulated spectra from the same can of household paint.
6.2 Kernel function
Our kernel function measures the dissimilarity between two spectra/vectors by considering their crosscorrelation (lags 10 to 10) and the Euclidean norm of their difference. As part of the comparison process, the kernel function filters out uninformative areas in a pair of spectra (see Figure 2 for an example of the results of the filtering process). Appendix B shows the marginal distributions of score vectors resulting from our kernel function.
6.3 Determination of
Figure 3 shows the distribution of the test statistic under , using the unconditional scenario described in Section 3.3. The three curves correspond to three scenarios where we consider , and control objects and trace objects. Although each distribution of the test statistic diverges from , we can use these distribution functions to empirically control the level of the test (see Table 1).
 level  0.05  0.10  0.25  0.50  0.75  0.90  0.95 

0.001895  0.006500  0.044075  0.217650  0.621125  0.872610  0.948015  
0.006495  0.022760  0.112025  0.361600  0.697725  0.907480  0.966925  
0.007095  0.024390  0.116750  0.395150  0.784650  0.943900  0.984020 
NOTES: Corresponding ’s for various values of associated with ECDFs of Figure 3 when , , and . Bolded values correspond to those used throughout this example.
6.4 Stage One: Power
Figure 4 presents the power curves associated with the test when , , and as a function of the level of dissimilarity between the average spectra for each source in a pair of sources, , for when . Each curve uses the corresponding bolded value of in Table 1 to determine the power as in Algorithm 3.
Figure 4 exhibits three typical behaviours of the power function. First, the power of the test approaches one as the distance between two sources increases. Thus, as the characteristics of the trace source become increasingly different from the characteristics of the control source, the test is increasingly able to detect a difference between the sources of the two sets of objects. Second, the power of the test approaches one at a faster rate as the number of control objects increases. Considering a larger set of control objects allows for more precisely assigning the distribution for the withinsource comparisons of the control objects, . This consequently improves the ability of the test to differentiate between sets of trace and control objects originating from different sources. Finally, Figure 4 verifies that the average typeI error for the test is indeed .
6.5 Stage two: random match probability (RMP)
The random match probability of any given fixed set of trace objects can be estimated using Algorithm 4. This algorithm considers a fixed set of trace objects and a randomly sampled set of control objects for each of the sources representing the population of potential sources. For each source, taken in turn, Algorithm 1 is used to test whether the sources of the and objects are indistinguishable. The result of each test will be influenced by the random selection of the control samples for the considered source; hence the RMP estimate for a fixed set of trace objects may vary with different random sets of control samples from a fixed set of sources. Figure 5 shows the variability between RMP estimates of a unique set of trace objects originating from the source indicated by the xaxis when sets of , and 15 control objects are repeatedly sampled from the other 165 sources. Each boxplot represents 20 repetitions of Algorithm 4 for the same set of trace objects.
Several conclusions can be drawn from the data presented in Figure 5. First, the RMP of trace samples from different sources is not the same (i.e., the locations and spreads of the boxplots vary between different sources). This indicates that some sources of paint appear to have characteristics that are less common in a population of paint than others. Evidence represented by an association between a set of trace objects and control objects from a source displaying such “rare” characteristics will carry more weight. Second the median RMP estimate for a unique set of trace objects from a given source is much smaller when sources from the population are represented by and 15 control objects, than when they are represented by control objects. This is a direct result of the observation, in the previous section, that the power of our test increases with the number of control objects. Third, the spread of the RMP estimates for a unique set of trace objects is also much smaller when the sources from the population are represented by and 15 control objects. Greater numbers of observed samples per source imply less uncertainty on the test statistic’s parameters, which in turn result in greater precision of the RMP estimates. Finally, increasing the number of control samples from 10 to 15 does not appear to drastically improve the quality of the RMP estimates, despite the significant increase in computational cost.
Since RMP estimates are conditioned on the observed trace objects, it is possible that the variability in the locations of the RMP estimates presented in Figure 5 is only due to the particular choice of sets of trace objects used in the experiment, and not to the respective rarity of the characteristics of their sources in a population of paint. To test whether the apparent variability in rarity of paint characteristics indicated by the data in Figure 5 is genuine, we repeated the experiment that led to Figure 5, except that we did not keep the trace objects fixed. The trace objects were randomly sampled for each of the 20 repetitions of Algorithm 4. The results of this experiment are presented in Figure 6 and confirm the conclusions drawn from the data in Figure 5.
7 Benefits and limitations of the twostage approach
While the twostage approach was proposed several decades ago, very little work has been done to formalise and develop it. Nevertheless, this approach has several advantages over the more commonly advocated Bayes factor, and it is not surprising that the value of many evidence types is assessed using some form of (possibly informal) twostage approach (Aitken and Lucy, 2004).
Firstly, the flow of the twostage approach appears natural to forensic scientists, legal practitioners and lay individuals: (1) the trace and control objects are compared to determine if they could come from the same source; (2) if the two sets of objects are considered similar, the implications of this finding is assessed. Each stage focuses on its own specific question. Because the two stages appear so well separated, yet logically connected, they are easy to explain, and even easier to understand by a lay audience, such as a jury (Neumann et al., 2016). Scientists can discuss their conclusions for each stage in turn, and how they fit into the overall inference problem. In addition, the issue of error rates naturally occurs in relation to the decision that has to be made at the end of the first stage. The clarity of the twostage approach to lay individuals has to be put in perspective with the confusion that usually occurs, even among scientists and legal practitioners, when they are asked to use a Bayes factor to update their prior beliefs on the source of trace samples.
Secondly, performing Bayesian model selection using highdimensional complex data requires using likelihood functions that rely on intractable probability measures in the input space of the data. In addition, data dimension reduction techniques, such as principal component analysis, may engender loss of information that may impact the weight of the evidence in unpredictable ways (e.g., the wrong model may end up being supported at an unknown rate) and may not be applicable to heterogeneous complex data types. On the contrary, it is almost always possible to design a test statistic, study its distribution empirically using a large sample of pairs of objects from the same source, and control the
level of the test, even in the context of highdimensional complex data.In this paper, we propose a semiparametric model that offers several major advantages over a fully empirical approach:

Assuming that our decision criterion is an unconditional , for the reasons discussed in the previous section, and that the value of for the desired level has been obtained based on a large scale experiment prior to using the test in an operational situation, using the test only requires considering the observations made on the trace and control objects;

The test accounts for the specific characteristics of the source of the control objects;

The same test statistic can be used for any type or dimension of data. It can be tailored to the data through the use of multiple kernel functions that can be combined (Schoelkopf and Smola, 2001) to maximise the power of the test and ensure that the main assumption of the model, the normality of the score distribution under , is satisfied;

By construction, the test statistic requires only three parameters to be considered, irrespectively of the type and dimension of the raw data. We have shown in the previous sections that this enables us to implement efficient computational strategies to calculate the test statistic and study its distribution under and its power given a suitable sample of sources from a population.
The twostage approach suffers from several limitations, and, when possible, Bayes factors are preferred to support the inference process. The main objection to the twostage approach rests in that evidence evaluated using this approach cannot be combined with other pieces of evidence in a logical and coherent manner (see Robertson and Vignaux (1995) for a discussion). Another major flaw has been described by Robertson and Vignaux (1995) as the “fallofthecliff” effect (similar to Lindley’s paradox (Lindley, 1957)): the decision to reject the hypothesis of a common source during the first stage relies only on whether the value of the test statistic is smaller or larger than a given threshold, and not on the magnitude of the distance between the test statistic and the threshold. Values of the test statistic just beyond the decision threshold will result in a drastically different decision (i.e., exclusion of the considered source) than values just before the threshold (i.e., association of the trace and control objects). In practice, this implies that the source of the control objects is either unequivocally excluded as the source of the trace objects (if is rejected), or that the inference process exclusively favours over (since the match probability of the trace objects in a population of sources will always be lower or equal to 1). By design, the twostage approach cannot result in a situation where is favoured compared to without being entirely excluded. A further issue with the twostage approach is related to the power of the test as the quality of the information contained in the trace and control objects decreases. Decreasing quantity and quality of information result in failing to reject at a higher rate. For most applications of statistical hypothesis testing, this would be considered conservative. However, the situation in the forensic context is reversed: failing to reject implies that the suspected source cannot be excluded, and critically, that the inference process will favour the hypothesis that the considered source is in fact the source of the trace versus the hypothesis that the trace material originates from another source in a population of potential sources. In other words, traces with lower quality and quantity of information (i.e., bearing less discriminating features) will be easier to associate to any given suspected source. In the context of the criminal justice system, this behaviour of the test is clearly biased in favour of the prosecution.
8 Conclusion
In this paper, we develop and formalise a twostage framework for the inference of the source of trace objects in a forensic context. Our approach is particularly useful when the objects are characterised by highdimensional and complex data, such as chemical spectra, for which likelihoodbased inference is not possible.
Although it is not without limitations, the twostage approach presented in this paper has several major advantages. First, our method provides a framework that enables structured and statistically rigorous inferences in forensic science. The proposed approach may not be as logical and coherent as a fully Bayesian inference framework; however, because its two stages address different and welldefined issues related to the inference process, cognitive research supports that the twostage approach is a more natural reasoning framework for forensic practitioners and lay individuals alike.
Second, the test statistic and associated likelihood structure proposed in this paper are invariant to the type and dimension of the considered data. The test statistic relies on a kernel function that can be tailored to suit any situation. Thus, the same test statistic can be used in almost any situation where highdimensional, complex and heterogenous data are considered. In addition, the test’s only major assumption can be satisfied through the design of the kernel function, and is naturally satisfied as the dimension of the objects increases.
Much work remains to be done before implementing this methodology in forensic practice. For example, we are developing a model to estimate the match probability in the second stage of the approach and replace the current empirical strategy originally proposed by Parker. Furthermore, large reference collections of different types of evidence (e.g., paint, fibers, glass) need to be gathered.
The application of our method to FTIR data of paint shows two important results: FTIR spectra of paint contain highly specific information that enable discrimination of paint samples from different sources; and the characteristics of some paint sources are rarer than others and will carry more probative value. Our results do not only show that paint evidence is very probative in general, but they also show that our approach works well with the number of samples typically encountered in casework. Our approach can easily be implemented to determine the probative value of paint evidence in any given case, hence addressing the recurrent criticisms related to the lack of quantitative support for forensic conclusions. Finally, our approach can easily be extended to other evidence types, such as transferred automotive paint in road accidents, transferred glass fragments during burglaries, assaults or shootings and transferred fibres from items of clothing during assaults.
References
 Aitken and Lucy (2004) Aitken, C. and D. Lucy (2004). Evaluation of trace evidence in the form of multivariate data. Applied Statistics 53(4), 109–122.
 Aitken et al. (2010) Aitken, C., P. Roberts, and G. Jackson (2010). 1. fundamentals of probability and statistical evidence in criminal proceedings. In Communicating and Interpreting Statistical Evidence in the Administration of Criminal Justice, Guidance for Judges, Lawyers, Forensic Scientists and Expert Witnesses. Royal Statistical Society.
 Armstrong (2017) Armstrong, D. (2017). Development and properties of kernelbased methods for the interpretation and presentation of forensic evidence. Dissertation, South Dakota State University. https://openprairie.sdstate.edu/etd/2175/.
 Armstrong et al. (2017) Armstrong, D., C. Neumann, C. Saunders, D. Gantz, J. Miller, and D. Stoney (2017). Kernelbased methods for source identification using very small particles from carpet fibers. Chemometrics and Intelligent Laboratory Systems 160, 99–209.
 ASTM E292613 (2013) ASTM E292613 (2013). Standard Test Method for Forensic Comparison of Glass Using Micro Xray Fluorescence (XRF) Spectrometry. ASTM International, West Corshokocken, PA.
 Butler (2015) Butler, J. (2015). Advanced Topics in Forensic DNA Typing: Interpretation. Elsevier Academic Press.
 Champod and Taroni (1997) Champod, C. and F. Taroni (1997). Bayesian framework for the evaluation of fibre transfer evidence. Science and Justice 37, 75–83.
 Curran et al. (1997) Curran, J., C. Triggs, J. Almirall, J. Buckleton, and K. Walsh (1997). The interpretation of elemental composition measurements from forensic glass evidence: I. Science and Justice 37, 241–244.
 Evett (1977) Evett, I. (1977). The interpretation of refractive index measurements. Forensic Science 9, 209–217.
 Kaye (2017) Kaye, D. (2017). Hypothesis testing in law and forensic science: A memorandum. Harvard Law Review Forum 130(5), 127–136.

Kingston (1965)
Kingston, C. (1965).
Applications of probability theory in criminalistics.
Journal of the American Statistical Association 60, 70–80.  Kirk (1953) Kirk, P. L. (1953). Crime Investigation (Second ed.). John Wiley and Sons Ltd.
 Lindley (1957) Lindley, D. (1957). A statistical paradox. Biometrika 44, 187–192.
 Massonnet et al. (2014) Massonnet, G., L. Gueissaz, and C. Muehlethaler (2014). Paint: interpretation. Wiley Encyclopedia of Forensic Science.
 Muehlethaler et al. (2014) Muehlethaler, C., G. Massonnet, and P. Esseiva (2014). Discrimination and classification of ftir spectra of red, blue and green spray paints using a multivariate statistical approach. Forensic Science International 244, 170–178.
 Neumann et al. (2016) Neumann, C., D. Kaye, G. Jackson, V. Reyna, and A. Ranadive (2016). Presenting quantitative and qualitative information on forensic science evidence in the courtroom. CHANCE 29, 37–43.
 Park and Carriquiry (2018) Park, S. and A. Carriquiry (2018). Learning algorithms to evaluate forensic glass evidence.
 Parker (1966) Parker, J. (1966). A statistical treatment of identification problems. Journal of FSSoc 6, 33–39.
 Parker (1967) Parker, J. (1967). The mathematical evaluation of numerical evidence. Journal of FSSoc 7, 134–144.
 Parker and Holford (1968) Parker, J. and A. Holford (1968). Optimum test statistics with particular reference to a forensic science problem. Journal of the Royal Statistical Society. Series C 17(3), 237–251.
 Robertson and Vignaux (1995) Robertson, B. and G. Vignaux (1995). Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons Ltd.
 Schoelkopf and Smola (2001) Schoelkopf, B. and A. Smola (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimisation and Beyond (Adaptive Computation and Machine Learning) (1 ed.). The MIT Press.
 Taroni et al. (1999) Taroni, F., C. Champod, and P. Margot (1999). Forerunners of bayesianism in early forensic science. Journal of Forensic Identification 49, 285–305.
Appendix A: Summary of the main results from Armstrong et al. (2017)
Given two vectors of measurements and representing the observations made on two objects, , sampled from a common source, a kernel function, , is used to measure their level of similarity and report it as a score, . The score is represented by a linear random effects model
(A.1) 
where is the expected value of the score between any two objects from the same considered source; , are random effects representing the contributions of the th and th objects, and is a lack of fit term, such that and , and .
We note that the kernel function at the core of the model, , can be designed to accommodate virtually any type of data. It needs satisfy only two requirements: it must be a symmetric function, that is ; and it must ensure that the marginal distribution of is Normal to satisfy the assumption made on the score model in (A.1). The assumption of normality is the main assumption made by Armstrong et al. (2017) when developing their model; it is reasonable for highdimensional objects and can be satisfied through careful design of the kernel function (Armstrong, 2017).
The vector of all possible pairwise comparisons between reference objects can be represented by a vector, of objects, given by . The multivariate extension of the model in (A.1) to is given by
(A.2) 
where is a one vector of length , is an design matrix (where each row represents an combination, consisting of ones in the and columns and zeros elsewhere), is the vector of random effects for the considered objects, and is the vector of corresponding to each pair of objects. By construction,
(A.3) 
Armstrong et al. (2017) show that has three different eigenvalues
(A.4) 
with multiplicity 1, , and respectively. Armstrong et al. (2017) also show that
(A.5)  
where and are eigenvectors orthogonal to . Importantly, Armstrong et al. (2017) note that
(A.6)  
Using these results, Armstrong et al. (2017) show that the likelihood function, , can be rewritten as an independent sum of squares
(A.7)  
where is the average of the elements in , and
(A.8)  
where is the average of the elements in involving object .
Finally, Armstrong et al. (2017) show that closedform solution estimates for the parameters of the model exist and can be derived from Table A.1 to obtain
(A.9) 
Source  df  SS  MS  E(MS) 

A  
Error  
Total 
At this point, we simply note that we presented the results obtained by Armstrong et al. (2017) for a set of objects known to come from a single source and that it is trivial to scale these results for a vector containing the pairwise scores resulting from the crosscomparisons of objects, if they are assumed to originate from the same source.
Appendix B: Multivariate normality of scores used in the application of our method
To examine whether the scores considered in this paper approximately satisfy the assumption of multivariate normality of the score model proposed by Armstrong et al. (2017), we consider 332 triplicates of spectra originating from the same source. These 332 triplicates consist in two disjoint triplicates of spectra from each of the 166 paint sources described in Section 6. The top row of Figure A.1 portrays the marginal distributions of the scores in their original space. By expressing the original 3dimensional vectors of scores as a function of the space defined by the eigenvectors of their sample covariance matrix, we can observe the marginal distributions of the score vectors along orthogonal axes, and better determine if the marginal distributions follow a Normal distribution. Figure A.1
(bottom row) shows that, although the data is approximately spherical in the first two dimensions of the eigenspace, there is a rather significant departure from normality when eigendimensions 2 and 3 are plotted against each other. Given the results in Section
6, we purport that this deviation from multivariate normality does not affect the ability of the model to correctly classify and differentiate spectra, and thus testifies to the robustness of the model: despite the lack of normality, the model is still able to correctly associate and differentiate spectra originating from the same and different sources, respectively.
Comments
There are no comments yet.