A Segment-ology TIDBIT
Are Match Clusters based on the genealogy or the DNA? Our Matches share both with us. In other words, do Match Clusters tend to focus around a Common Ancestor (CA) with most of the Matches or on a Triangulated Group (TG) with most of the Matches? Do we have an Ancestor first (who points to TGs) or do we have a Cluster TG first (which points to a CA)?
Some have opined that a Match Cluster is the same as, or similar to, a TG. I think Match Clusters form around Ancestors, and that each of our Ancestors can be associated with only certain TGs. In other words, the Ancestors come from Clusters, and Clusters may have multiple TGs.
Let’s look at what we “see” with clustering. With the Leeds Method the focus is on the 4 grandparents. I can assure you that each of our grandparents is associated with multiple TGs. I currently have 380 TGs covering 98% of my 45 chromosomes. Let’s round that to 400 TGs. This means that each grandparent would have roughly 100 TGs (or alternatively, each of those TGs would come from a distant ancestor, down to me through the one grandparent). If we have a Clustering Matrix with 4 large clusters, each one would almost certainly be from a different grandparent and maybe 100 TGs.
When we look at various clustering results, we usually see 4 clusters, or 8 clusters or 16 clusters, etc. This is a tip off that each cluster represents an Ancestor at some generation. NB: Clustering is not as precise as Triangulation, and not every Match in a cluster will be from the same Ancestor. Some Matches will share multiple Ancestors with us – sometimes from both sides (Paternal and Maternal) of our ancestry. The Clustering process has a hard time dealing with that – and it is an imperfect system. And all of the Clusters may not be from the same generation – but most will be. However, taken as a whole, the Clustering process does a good job of grouping Matches by Ancestors. Each Cluster will represent an Ancestor, even if every Match in it does not have that particular Ancestor.
Another way to look at this is to remember that each TG comes from one specific ancestral line (from a distant ancestor down to you). But turning this around is not true – we cannot say that each Ancestor passes down one TG. Clearly each parent, grandparent, great grandparent, etc. passes down multiple TGs.
With the above in mind, let’s look at Clusters and Match thresholds. The Leeds Method uses second cousins (2C) and above. The 2C are from a Great grandparent couple and represent their child which is your grandparent. This is why the Leeds Method tends to result in 4 Clusters of Matches – one for each grandparent. The “threshold” here is about 200cM (to accommodate most 2C which average about 230cM)
If we drop the threshold to about 60cM, we’d pick up mostly 3C (average 74cM) and closer, and we’d wind up with roughly 8 main Clusters in a Matrix – one for each of eight Great grandparents. At AncestryDNA, they use 20cM as a threshold for 4C, but we’ve experienced many actual 5C-8C in this mix. From the Shared cM Project we have the average for a 6C at 21cM, so I’m pretty sure the 20cM threshold will pick up some 7C and 8C, too.
In any case, when I ran the Genetic Affairs Match Clustering program (using a 20cM threshold download from DNAGedcom Client), I got 3571 Matches and 158 clusters. That’s pretty close to 128 clusters, with one from each 5xG grandparent. This means to me that there were a number of 7C in the mix from 6xG grandparent couples, with a few more 8C Matches which split some of the clusters and brought the total up to 158. This seems to be a reasonable outcome as some of the clusters are only 3 Matches.
So my conclusion is that Match Clustering results in clusters around your Ancestors. And each cluster may include more than one TG. By selecting a threshold, you can roughly target the generation you want – 200cM for 4 grandparents; 60cM for 8 Great grandparents; 30cM for 16 2xG grandparents; and 20cM will get you roughly 128 5xG grandparents. It gets fuzzier with lower thresholds, because these lower thresholds can be from a range of cousinships.
The next, very important, step is to tag each cluster with the most logical CA (most from one generation). Over 230 of my AncestryDNA Matches are in known TGs (from uploads to GEDmatch or tests at other companies). In these cases some TGs might also be tagged to each cluster. This is exciting, because each Cluster can then point to a CA and selected TGs that the other Matches in the Cluster will likely have. I’ve already used the CA clues to find CAs for other Matches, and noticed that some Clusters tend to have only one or two TGs… Very important, and useful, clues!
[edited info on the Leeds Method 2/1/19]
[22Z] Segment-ology: Match Clusters – Chicken or Egg? TIDBIT (11 Feb 2019)