Match Clusters – Chicken or Egg?

A Segment-ology TIDBIT

Are Match Clusters based on the genealogy or the DNA? Our Matches share both with us.  In other words, do Match Clusters tend to focus around a Common Ancestor (CA) with most of the Matches or on a Triangulated Group (TG) with most of the Matches? Do we have an Ancestor first (who points to TGs) or do we have a Cluster TG first (which points to a CA)?

Some have opined that a Match Cluster is the same as, or similar to, a TG. I think Match Clusters form around Ancestors, and that each of our Ancestors can be associated with only certain TGs. In other words, the Ancestors come from Clusters, and Clusters may have multiple TGs.

Let’s look at what we “see” with clustering. With the Leeds Method the focus is on the 4 grandparents. I can assure you that each of our grandparents is associated with multiple TGs. I currently have 380 TGs covering 98% of my 45 chromosomes. Let’s round that to 400 TGs. This means that each grandparent would have roughly 100 TGs (or alternatively, each of those TGs would come from a distant ancestor, down to me through the one grandparent). If we have a Clustering Matrix with 4 large clusters, each one would almost certainly be from a different grandparent and maybe 100 TGs.

When we look at various clustering results, we usually see 4 clusters, or 8 clusters or 16 clusters, etc. This is a tip off that each cluster represents an Ancestor at some generation. NB: Clustering is not as precise as Triangulation, and not every Match in a cluster will be from the same Ancestor. Some Matches will share multiple Ancestors with us – sometimes from both sides (Paternal and Maternal) of our ancestry. The Clustering process has a hard time dealing with that – and it is an imperfect system. And all of the Clusters may not be from the same generation – but most will be. However, taken as a whole, the Clustering process does a good job of grouping Matches by Ancestors. Each Cluster will represent an Ancestor, even if every Match in it does not have that particular Ancestor.

Another way to look at this is to remember that each TG comes from one specific ancestral line (from a distant ancestor down to you). But turning this around is not true – we cannot say that each Ancestor passes down one TG. Clearly each parent, grandparent, great grandparent, etc. passes down multiple TGs.

With the above in mind, let’s look at Clusters and Match thresholds. The Leeds Method uses second cousins (2C) and above. The 2C are from a Great grandparent couple and represent their child which is your grandparent. This is why the Leeds Method tends to result in 4 Clusters of Matches – one for each grandparent. The “threshold” here is about 200cM (to accommodate most 2C which average about 230cM)

If we drop the threshold to about 60cM, we’d pick up mostly 3C (average 74cM) and closer, and we’d wind up with roughly 8 main Clusters in a Matrix – one for each of eight Great grandparents.  At AncestryDNA, they use 20cM as a threshold for 4C, but we’ve experienced many actual 5C-8C in this mix. From the Shared cM Project we have the average for a 6C at 21cM, so I’m pretty sure the 20cM threshold will pick up some 7C and 8C, too.

In any case, when I ran the Genetic Affairs Match Clustering program (using a 20cM threshold download from DNAGedcom Client), I got 3571 Matches and 158 clusters. That’s pretty close to 128 clusters, with one from each 5xG grandparent. This means to me that there were a number of 7C in the mix from 6xG grandparent couples, with a few more 8C Matches which split some of the clusters and brought the total up to 158. This seems to be a reasonable outcome as some of the clusters are only 3 Matches.

So my conclusion is that Match Clustering results in clusters around your Ancestors. And each cluster may include more than one TG. By selecting a threshold, you can roughly target the generation you want – 200cM for 4 grandparents; 60cM for 8 Great grandparents; 30cM for 16 2xG grandparents; and 20cM will get you roughly 128 5xG grandparents. It gets fuzzier with lower thresholds, because these lower thresholds can be from a range of cousinships.

The next, very important, step is to tag each cluster with the most logical CA (most from one generation). Over 230 of my AncestryDNA Matches are in known TGs (from uploads to GEDmatch or tests at other companies). In these cases some TGs might also be tagged to each cluster. This is exciting, because each Cluster can then point to a CA and selected TGs that the other Matches in the Cluster will likely have. I’ve already used the CA clues to find CAs for other Matches, and noticed that some Clusters tend to have only one or two TGs… Very important, and useful, clues!

[edited info on the Leeds Method 2/1/19]

[22Z] Segment-ology: Match Clusters – Chicken or Egg? TIDBIT (11 Feb 2019)

21 thoughts on “Match Clusters – Chicken or Egg?

  1. Hi Jim: really appreciate the insight you have been sharing with us all these years. I have recently been grappling with an issue that is a bit off topic from this post that I’m hoping for some insight on. It concerns multiple segment matches in Ancestrydna and 23andme. I’ve noticed that say a 0.18% 2 segment match is considered more closely related than a 0.17% 1 segment match. I don’t understand why it should be the case that 2 segments that are just above threshold ( and so noisy ) should be considered more significant than a single segment that is much over the threshold. Looking forward to your thoughts on this.

    Like

  2. Hi Jim: really appreciate the insight you have been sharing with us all these years. I have recently been grappling with an issue that is a bit off topic from this that I’m hoping for some insight on. It concerns multiple segment matches in Ancestrydna and 23andme. I’ve noticed that say a 0.18% 2 segment match is considered more closely related than a 0.17% 1 segment match. I don’t understand why it should be the case that 2 segments that are just above threshold ( and so noisy ) should be considered more significant than a single segment that is much over the threshold. Looking forward to your thoughts on this.

    Like

    • river, I’m not sure, but maybe the fact that you actually share two different segments is more significant than just one. Most of our Matches, by far, only share one DNA segment with us, and are relatively distant cousins. The closer Matches tend to share more segments and more total cMs.
      On the other hand – don’t worry about it, be thankful for the opportunity to work with two TGs. It’s much more important to communicate with every Match you can and try to figure out how you are really related. The main thing is to focus on CAs and TGs. Jim

      Like

  3. Jonathan,

    My view is comes from thinking the data as being on a curve, not in chunks with discontinuous jumps from one “level” to another. At each cousinship, we get some kind of a bell curve for the cMs in shared segments. As we look at IBD vs non-IBD segments – there is a curve of cM vs % IBD (below about 15cM). There is a curve comparing shared segment thresholds (cM) vs the number of TGs that naturally occur in a genome.
    With Shared Match Clusters, as the cM threshold decreases the number of Clusters increases. However, I think this is more of a “bumpy” curve (maybe like smoothed out stair steps), with a tendency toward 4, 8, 16, etc. clusters.
    In any case, I have 380 TGs (roughly based on a 7-10cM threshold for shared segments) – spread over 45 chromosomes, that’s about 20 cM per TG. But – there is a curve here also – my TGs range from 7 to 74cM (TGs cannot be formed by just dividing our DNA into so many equal blocks – the randomness of recombination over a number of generations dictates where the crossover points are and where the TGs will form),
    As the number of clusters increase (based on lowering the cM of shared segments in the pool), the number of TGs per Cluster with decrease.
    If we accept the one TG per Cluster concept, then we have to assume that with, say, 200 Clusters, then 180 of my TGs are not accounted for. We would have to conclude that the Clusters don’t cover a large part of my DNA.
    To my mind, we need to be able to make a statement that applies across the Clustering curve – it should apply to a few large Clusters, all the way down to 380 Clusters for 380 TGs (in my case). i think the only thing that applies consistently is that Clusters form around (mainly) Ancestors, and that for the smallest Clusters, we have distant Ancestors associated with only one TG.

    Like

  4. Sorry, I disagree with large parts of this.

    First of all, there is a maximum theoretical number of clusters that anyone can have. I don’t know what that number is. If we say that each shared segment has to be at least 6 cM in size, then the maximum _possible_ number of clusters would be around 1100 — but you can only get that number if each segment was lined up next to each other with no gaps between them. If we suppose there’s an average of 3 cM between each shared segment, then the maximum would be about 700 segments. If the average segment size is closer to 20 cM, then we’re talking about more like 200-ish segments, which is in the ballpark of the 158 number that you report. I suspect those numbers could be calcluated in some statistical sense, although I don’t have the details to do that calculating myself.

    In fact, everyone I’ve looked at has roughly the same number of clusters. Say between 75-200 total clusters whether they have 500 matches or 5,000. More matches lead to larger clusters, not especially to more clusters. This also supports the idea of a theoretical maximum number of clusters.

    Secondly, it’s easy to DISprove the idea that each cluster represents one of 128 5xG grandparents. One approach is to maually assign 50 or more clusters to earlier generations; then you don’t have 128 clusters left for the 5xG grandparents who were not assigned. I can’t do that myself with my own ancestors; my tree doesn’t go back that far. But there are plenty of people who have deep trees with many relatives known to have MRCAs at least that far back.

    It IS meaningful to say that each segment originated with some specific ancestor. At some distant generation, some ancestor inherited two adjacent segments, one from each parent, then transmitted a smaller segment that contained the overlap point. The transmitted segment would be associated with the earliest ancestor who inherited both halves; it truly would not be possible to trace the segment back further than that. It feels to me that this could also be estimated in a statistical sense by someone who knows more than I do. Without that statistically analysis, I cannot conclude that the starting point of each segment is in the 10-12 generation maximum range of most people’s tree research.

    I disagree that it provides any useful insight to assign clusters to specific ancestors. Clusters are associated with ancestral lines. The earliest _identified_ ancestor within that line will necessarily have the greatest number of living descendants. As a result, that ancestor will appear to be the “source” of that cluster — until you identify an earlier generation and adjust your assumptions. Until you identify every member of the cluster, you have no idea if the earliest identified ancestor is indeed the earliest MCRA for all members of the cluster. Indeed, the more unidentified members in the cluster, the more likely that you’re missing an earlier ancestor who ties them all together.

    I also disagree that clusters may have multiple TGs. Or rather, everything I’ve seen supports the idea that each cluster is defined by a single TG. Of course, clusters could _have_ multiple TGs, otherwise matches with more than one shared segment could never be a member of any cluster. But there is one TG that “defines” each cluster, and any other TGs are along for the ride. (I do make a distinction for very small clusters, which are not well-defined in the first place. Assume that I’m talking about clusters of 10+ members here.)

    I also agree that very strong matches are special. The Leeds Method works because very strong matches have enough segments shared with the test taker that they can effectively span clusters, creating “clusters of clusters” if you like. Very very strong matches, say over 600 cM are in a category of their own and must be excluded from the Leeds Method. Weaker matches, say below 60 cM or so, don’t work with the Leeds Method. All of that is true and all of that is useful — as long as you limit your research to the grandparent or great-grandparent level. The success of the Leeds Method in analyzing strong matches doesn’t carry over to weaker matches, and neither do the conclusions drawn from the strong matches carry over to the weaker ones.

    When you cluster your matches above 60 cM, you _should_ have a maximum of 8-ish clusters. You have a very good idea that a 60 cM match is a third cousin with maybe a few fourth cousins mixed in. You can’t draw the same conclusions at the 20 cM level. An arbitrary 20 cM match could be anywhere from a 3C to a 12C and the clusters at that level can vary just as widely.

    Clusters are associated with shared segments, and therefore with the ancestral lines of descent for each of those segments. The data does not support any other conclusion.

    Liked by 1 person

    • Jonathan – I thought about not approving your comments – long, lots of disagreement, etc.; but I’ve never before blocked adverse comments. We’ll see where this goes.
      I’m sorry you disagree, but I’ve done a lot of work and analysis with Clusters and I stand by my post.
      One thing I tried to convey, is the various levels available for clustering. If you use only close cousins with large segments you tend to get fewer clusters – usually 4 – which represent the 4 grandparents. As you rerun the Cluster Matrix with a smaller threshold, you’ll bring in more cousins and get a larger matrix – by fine tuning it you’ll get about 8 clusters – one for each Greatgrandparent, roughly… The example I used dropped the threshold down to 20cM, which picked up what AncestryDNA uses for a 4C designation (even though there are a number of more distant cousins in that group). I have done a lot of work with my 4C Matches – trying to enter Notes in every one of them (that group is growing faster than I can keep up). In any case I have a lot of Matches in that group with a known CA – many with multiple CAs, not on the same ancestral line. And I have a number of Matches in that group for whom I have segment data. Any Match I have with segment data, gets a TG ID (I have over 98% of my 45 chromosomes “covered” – litterally side-by-side – with TGs). And closer Cousins will have multiple shared segments and TGs. So a Matrix with 4 or 8 Clusters, representing grandparents or great grandparents, will have close cousin in them each Cluster with multiple shared segments/TGS.
      Note: I use MEDBetterDNA and “see” these TG IDs and CA IDs in my Shared Match lists as well as in a Cluster Matrix.
      In my case there are basically no more TGs possible, unless and until I start breaking existing TGs apart (for that I need to lower the theshold even more – and I want to “walk the ancestors back” on my 380 TGs and get them solid, before I try to fine tune it any more. So if a Matrix includes a group of Matches that roughly covers your DNA, then the 4 or 8 Clusters would tend to include all of my 380 TGs, somehow.
      I think I’m seeing multiple TGs (2 or 3 or 4 depending on how close/distant) that go back “through” the Cluster CA.
      One point to make is that once you pick a threshold, you have a set of Matches. Hopefully they will roughly cover all your DNA – and if so they will form clusters that roughly correspond to a generation (4, 8, 16, 32, etc Clusters, representing your Ancestors at that generation. Of course some of your closer cousins may span more that one Cluster – your Matches will be on some kind of continuum (from close to distant – out to the threshold). But with some tweaking of the threshold, I think you’ll get a nice matrix picture.
      I’d invite others to comment on this – are you tending to see Clusters for most of the Ancestors in a generation, or something else? We now have enough folks using Clusters and analyzing the data, that we should be able to form a consensus. Not just my observations/opinions (or yours either Jonathan).
      I think it important to try to understand what Clusters are telling us – so that we can interpret the data better.

      Like

      • To Jim and Christopher (and anyone else): Good. Now provide data.

        I’ve provided data that contradicts your conclusion. One of the key examples comes from your own clusters, Jim, which you allowed me to share a few months ago (https://www.facebook.com/groups/geneticgenealogytipsandtechniques/permalink/545834895880215/). In that example you had three large clusters where each cluster was very convincingly associated with one particular TG. Two out of the three clusters were associated with the same ancestral pair. Each of the clusters contained matches related to you at multiple genetic distances (5C, 6C, 7C, 8C, etc). That’s pretty strong support for the idea that cluster = TG.

        I do agree that some of the people in each cluster also had other TGs. That doesn’t change my position. There’s still one TG that “defines” each cluster. Any person who shares more than one TG will naturally appear in some cluster associated with one of those TGs, and the other TGs for that person will come along for the ride.

        Brian Schuck also posted a beautiful analysis recently, showing how changes in the length of a shared segment produce finer-level detail within a cluster (https://www.facebook.com/groups/1210379019118379/permalink/1217693721720242/) (reposted: https://www.facebook.com/groups/geneticgenealogytipsandtechniques/permalink/577384866058551/). That’s a very direct relationship between the appearance or “shape” of the cluster and the underlying DNA data.

        I have other examples with data that I don’t have permission to share, so not helpful here.

        If you have counterexamples, that’s great. Please share them so that we all can learn from them. To support your position, I’d be looking to see a cluster that:

        * Has a decent number of members, say 10 or more
        * Is generated from data down to 20 cM (not artificially truncated at 40 or 60 or higher)
        * Demonstrably does NOT have a single TG (or DNA segment) that ties the cluster together

        FWIW, I’ve been working with cluster data for about 9 months now, and I spent a LONG time trying to prove exactly what you are saying, that cluster = ancestor. I wanted that to be true. It would be a much more useful result for people researching their own ancestry. The data doesn’t support it. If you are finding otherwise, then please provide the data.

        Liked by 1 person

      • Jonathan,
        How do you explain a Cluster Matrix using close cousins that results in 4 Clusters – one for each grandparent? And the close cousins each have multiple TGs. Same with a Matrix of 8 Clusters – only 8 TGs in all that data? If a cluster = TG, then it should work for any threshold.
        I agree as the threshold is lowered the Clusters appear to approach one Ancestor and one TG, but that is only at the outer limit. Most folks are using higher thresholds, and I think they will see what I see: closer Ancestors with multiple TGs.

        Like

      • Jim,

        There are three very different divisions of match data. The closest matches, say 600+ cM (not a hard dividing line) match “everyone” and are basically not useful for resolving clusters. The instructions for the Leeds Method explicitly exclude matches at this level. (I think the Leeds Method says to exclude over 400 cM. Let’s please not argue about the numeric cutoff — I’m already agreeing that the number isn’t exact.)

        The next level matches have a lot of segments shared with the test taker and each other. The matches at this level do resolve into clusters, and if you restrict your clusters to this level then the multi-way overlap between various segments will pull together the matches together into groups that represent grandparents or great-grandparents or so. This is the data that works with the Leeds Method, for exactly the same reason. This level spans 600 cM to 80 cM (or so — still not a hard dividing line)

        The third level has people that mainly share only 1-2 segments with the test taker. At this level, there isn’t enough commonality in the data to pull people together at the ancestor level. You’re talking about individual segment / TG inheritance.

        Three fundamentally different patterns of data distribution, three fundamentally different conclusions to be drawn from different levels of the data.

        Liked by 1 person

  5. I absolute agree with you that for large cutoffs (like 200cMs) what we see are ancestors. But my instinct is that at 20cMs we’re much more likely to be seeing TG’s of *distant* cousins or endogamous populations. This seems empirically checkable, (but I don’t have enough of my DNA in TG’s to really tell…). Out of curiosity (and I realize this might be a lot of work, but I think the answer would be very interesting) of your 158 clusters at 20cMs how many contain more than 1 TG? How many correspond exactly to a TG?

    Like

    • I put my TG ID and CA ID into the Notes box at AncestryDNA when I know them. The clustering program includes these Notes in the matrix. So I did read down the 3,000+ Matches to come to my conclusion. I am in the midst of trying to quantify this. NB, the Matrix also shows other Clusters each Match has some link to – Many Matches had 5-10 other clusters listed, so I think there is some amount of judgment as to how Matches are grouped into clusters. I’ll also state that some clusters have multiple CAs, too. I prefer the exactness of Triangulation, but that’s not an option at AncestryDNA. So I use Clustering there.

      Like

      • Taryn – I should add that as you get to fairly distant CAs of a TG, they may correspond to each other one to one. At the generation where we have 256 or 512 ancestors, and about 400 TGs (using a 7cM threshold), then there may well be a one TG per CA for many of these. But no matter what, each TG goes back on a line of your ancestors – from a parent to a grandparent to a great grandparent, etc. on – out to some Ancestor who first created the full segment represented by the TG (beyond that, the TG will split into two smaller TGs, which keep going back on that Ancestor’s mother and father lines). If that Ancestor is distant enough, he/she may only have this one TG; but (particularly with closer Ancestors) there is nothing to preclude that Ancestor from having passed a different DNA segment down along the same path, but on a different chromosome. Your Great grandparents probably passed down several DNA segments to you.

        Like

  6. Guess I am not grasping this. If I TG on chromo 3, overlapping on a segment with numerous other matches with our CA named George Smith, is this to say this is the “designated” George Smith Segment; and all the other people who are descended from him will also overlap at this specific place?

    Thank you.

    Like

    • Caith –

      Yes! If you are referring to a TG, and George Smith is a distant ancestor. In a TG, Matches will have a CA which is on George Smith’s ancestral line.

      No! If you are referring to a Cluster, and you are sure this Cluster is from Ancestor George Smith. Then most of the other Matches in this cluster will also descend from an Ancestor on George Smith’s ancestral line. Depending on which generation George Smith is in, there could be several TGs in this Cluster. If George Smith is your grandfather, there may be many TGs, if George Smith is a 7xG grandparent, then there will probably only be 1 or 2 TGs. In other words: closer CAs have more TGs, than distant CAs in a cluster. A Cluster and a TG are not the same.

      Jim Bartlett – atDNA blog: http://www.segmentology.org

      >

      Like

    • Caith,

      As I re-read your question – No! All Matches with a George Smith CA may not have the same TG. George Smith may have passed down multiple segments to you – each one in a separate TG. If George is a 2xG grandparent, he probably did pass down several segments (TGs) to you.

      Jim Bartlett – atDNA blog: http://www.segmentology.org

      >

      Like

      • Hypothetically, I have 10 matches in a TG and we all have the same tree match, George Smith. At Gedmatch, 6 of us match in a one-to-one; but the other 4 do not match in a one-to-one. Are you saying the 4 people who do not match in a one-to-one with us could actually be descended from George Smith, but they just got a different segment? If so, would it be on the same chromo, just different segments?

        Thank you, Jim

        Like

      • Caith
        I’m confused. You say 10 Matches, but only 6 match each other…. then you really only have 6 in the TG. The other 4 must be in a different TG. (I presume you have segment data for all 10)? Use the one-to-one to form a TG or TGs for those 4. Are you forming ICW Clusters? By hand or through a program. Maybe I haven’t talked enough about Match Clustering… it’s different than TGs.

        Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.