next up previous contents index
Next: Effect of outliers among Up: Choosing parameters for the Previous: Choosing parameters for the   Contents   Index


Choosing burn-in

The MCMC algorithm that is the basis of the treebuilding program starts out from a random tree and through successive iterations improves the tree until it converges to a set of trees that are likely given the marker data. This process is called burn-in. Each burn-in step consists of 10,000 updates to the tree and takes between 3 and 40 seconds on a 2.4 Ghz processor with 512 Mb of RAM. To ensure convergence of the MCMC, a sufficiently high number of burn-in steps must be selected. This number depends mostly the size of the analyzed dataset but other factors can influence it, such as the informativeness of the selected markers.

The following table shows some approximate convergence times that we have observed for datasets that we analyzed. These numbers only provide a starting point for the analysis of individual datasets.


Table 1: Expected number of update steps for the treebuilding algorithm until a sufficient degree of burn-in is achieved. Each step consists of 10,000 updates. The Sample size is displayed as the number of diploid individuals in the sample.
Sample size Number of markers burn-in
100 25 100
100 50 250
100 100 325
250 25 300
250 50 650
250 100 825
500 25 600
500 50 1125
500 100 1400


To monitor convergence of a Markov Chain Monte-Carlo run, a timeplot of likelihood of the estimated variables conditional of the haplotype data can be displayed by right-clicking on the trees of interest in the Tree Display, and selecting Show MCMC time series plot in the pop up menu. This plot displays the probability to observe the marker data conditional on the tree (Pr(Data$\vert$Tree)). As long as the Markov Chain is still converging, this plot is increasing, after convergence it should move horizontally.

If during treebuilding the option is taken to pick the tree at a neighboring locus as the starting tree of the MCMC and not a random tree, then the main purpose of burn-in is to make sure that the trees sampled are independent from the starting tree (This is the case if an unequal number of burn-in steps is chosen during treebuilding, see 6.2). In this case, the time series plot does not serve as an indicator for sufficient burn-in, as the probability of the starting tree conditional on the data may not be a lot smaller than the probability of a tree that is sampled after convergence.

If this analysis indicates, that the elected burn-in had been insufficient, it is possible to restart the MCMC from the last generated tree as described in 6.6.

Incomplete convergence will result in a reduced signal from the data, so if individual markers show association but the posterior distribution generated by the treepeeling step is basically flat this may be a sign that the burn-in is insufficient.


next up previous contents index
Next: Effect of outliers among Up: Choosing parameters for the Previous: Choosing parameters for the   Contents   Index
Sebastian Zoellner 2005-01-27