Match Cluster Report 1

Here is a report on my first Match Clustering effort.

Background info:

  1. I used a download of my AncestryDNA Matches above 20cM (I only had a few real 3rd cousins (3C) and below, and I just left them in.
  2. I have made extensive use of the Notes for as many Matches as I can – all of my almost 1,000 Hints; and maybe 1/4, so far, of all my 4C and closer. NB: AncestryDNA uses 20cM as the threshold for 4C designations, but many Matches in this group are 5C and 6C and I’ve found some who are 7C and 8C, with larger than average shared segments over 20cM.
  3. For every Match I can, I put the Shorthand CA ID and/or Shorthand TG ID in the Note box for that Match. See the Explanation of Header row below for links that explain these IDs.
  4. For each Match, I also put a line in each Note which includes a summary of the CA and TG IDs found in all the Match’s Shared Matches (SM). So even a Match with a Private Tree, or No Tree, or scrawny Tree, or can’t-find-anything-in-it large Tree, will get a line summarizing their SMs. This summary often provides a very specific “pointer” to a CA and/or TG. And this added info is very helpful in analyzing Clusters.

 

When I ran the Cluster Matrix, I developed this summary report:

22ACa Summary of Match Cluster 1

Next is a spreadsheet with the 86 Clusters, re-sorted on the CA.

Explanation of Header row:

Cluster – the Cluster # in the Cluster Spreadsheet presented to me.

First & Last – the Match # range included in this Cluster (Matches go from 1 to 3571)

SMs – the number of Shared Matches in each Cluster – a wide range…

CA – the CA ID (an Ahnentafel # – see this blogpost). When various Matches had CAs from different generations, but all on the same line, I used the most distant CA – Walking the Ancestor Back. A few Clusters had multiple CA lines, but I used CAs that Walked Back or were repeated several times.

Gen – as a convenience, I noted the generations back to the CA

TGs – the TG ID (see this blogpost). I all cases (I think) the last two numbers in each TG ID (being the TG grandparent) are in agreement with the CA ID. A number of Clusters have multiple TGs.

NB: The CAs and TGs come from my typed Notes for some Matches (I just haven’t gotten to all 3,571 of them, yet). The Notes are based on valid data – from the Match or GEDmatch (i.e. not guesses by me), but I’m fully aware that some of it is not conclusive; and another, closer and/or different, CA may be found. The TGs should not change, but often a Match will have multiple TGs, and only one would apply to the specific Cluster or CA.

Figure 1. Summary of 86 Clusters

22ACb Figure 1 Summary of 86 Clusters

A few notes on this data:

  1. I am sure that, eventually, the Clusters at the top of this table will be found to link to more distant Ancestors – I just haven’t found them yet.
  2. I am sure that, eventually, the two Clusters in Gens 10 and 11 will wind up with different, closer CAs – I just haven’t found them yet (there are relatively few Matches in each of these Clusters)
  3. For the bottom 9 Clusters, I do have TGs, so I can use Matches from other companies (already included in these TGs in my Master Spreadsheet), to find likely (or at least possible) CAs. It’s just that no CAs have been determined yet at AncestryDNA for the Matches in these Clusters.
  4. In Gen 9, CA 856 is my prolific and well documented HIGGINBOTHAM Ancestor; and I’ve Walked this Ancestor Back in at least two TGs. There are several lines from this Ancestor who intermarried.
  5. In Gen 8, Cluster 61, over 100 Matches – this was a brick wall at Gen 5, until I found several dozen Matches in Gen 6-8 with CUMMINS/CUMMINGS Ancestry, which I have subsequently researched into one Tree – also a prolific line. And a new branch of my Tree!
  6. I’m sure there will be unfolding stories about other of these Clusters – I’m excited to see the way this is trending.

 

[22AC] Segment-ology: Match Cluster Report 1 – by Jim Bartlett 20190214

Walking the Match Clusters Back

A Segment-ology TIDBIT

It appears to me that the next step for Clusters is “Walking the Clusters Back.”

By this I mean, start with the original Leeds Method, 2nd cousins (2C) and 3C, which tends to result in 4 Clusters – one for each grandparent. Often, particularly with known 2C and 3C, you will be able to determine the grandparent for each Cluster.

Then adjust the shared segment cM threshold to focus on 3C and 4C and try to get 8 Clusters. This may take some fine tuning in the threshold, but if you get plus or minus one or two Clusters, that’s OK – just work around it. Now if you can tell from the Matches who were in the 4 Cluster Matrix who repeat in this nominal 8 Cluster Matrix, you know which two Clusters belong to each of the 4 grandparents. Then, if you can figure out the great grandparent in one of the two Clusters for each grandparent, then the other Cluster should be for the other great grandparent.

Once you do what you can with the 8 great grandparent Clusters, adjust the cM thresholds, and rerun a Cluster Matrix to shoot for 16 Clusters and repeat the process.

This would be Walking the Clusters Back. And, in the long run, it might be more efficient and accurate that trying to start with a small cM threshold and getting a large number of Clusters – 128 to 512 Clusters. As the number of Clusters grows, more and more Matches will be conflicting; and more distant Matches may well share more than one Common Ancestor with you. It just gets more complicated to sort out at the larger Matrix levels. Walking the Clusters Back will make this process easier.

And the absolutely great news – a huge benefit of Clusters – is that Shared Matches will cluster when they are Private, or have little or no Tree, or even when they have a robust Tree, but you cannot find any Common Ancestor. In other words no genealogy, nor TGs for that matter, are required to place a Match in a Cluster. Also AncestryDNA Matches who share less that 20cM can also be manually added to a Cluster, based on their Shared Matches. This is bringing “into the fold” Matches which normally would not be grouped. And putting these Matches into Clusters at any level, really helps when it comes to building parts of their Tree out to meet yours.

Match Clusters really fine tune our data. Happy dance… [HT: Dana Leeds]

 

[22AB] Segment-ology: Walking the Clusters Back TIDBIT by Jim Bartlett 20190214

Confessions of a Match Clusterer

A Segment-ology TIDBIT

I’ve been explaining and discussing and arguing about Match Clusters recently. One debate concerns whether a Cluster is formed around an Ancestor (CA) or a Triangulated Group (TG). I argue that Clustering tends to result in 4 or 8 or 16 or 32 Clusters (or some other number of Ancestors in a given generation), depending on the shared segment cM threshold used. It might seem like I know the Ancestors and/or the TGs for each of my Clusters.

Confession time – I do not!

I’m working hard to determine as many as I can, but the current status is still spotty. I’m having a fairly good experience with TGs (98% of my DNA is covered by TGs); and know some Cas (over 80% of my TGs are known to the grandparent level). But I still have a long way to go on Chromosome Mapping. “Walking the Ancestors Back” on each TG is the name of that game.

I’m fairly new to Match Clustering, and as I look over that data (from my recent Cluster Matrix of AncestryDNA Matches over 20cM), I see lots of bare spots. I do see some trends, but in no way have I determined distant Ancestors (the CAs) for each of my Clusters. Nor have I determined the TGs for each of my Clusters – some Clusters have multiple TGs, and many have no TGs (after all, this Matrix is based on AncestryDNA data). It will take a while to analyze and weigh the information I’m collecting.

I’m working on a better analysis and a report of the one Cluster Matrix I’ve tried so far – stay tuned!

 

[22AA] Segment-ology: Confessions of a Match Clusterer TIDBIT by Jim Bartlett 20190214

Match Clusters – Chicken or Egg?

A Segment-ology TIDBIT

Are Match Clusters based on the genealogy or the DNA? Our Matches share both with us.  In other words, do Match Clusters tend to focus around a Common Ancestor (CA) with most of the Matches or on a Triangulated Group (TG) with most of the Matches? Do we have an Ancestor first (who points to TGs) or do we have a Cluster TG first (which points to a CA)?

Some have opined that a Match Cluster is the same as, or similar to, a TG. I think Match Clusters form around Ancestors, and that each of our Ancestors can be associated with only certain TGs. In other words, the Ancestors come from Clusters, and Clusters may have multiple TGs.

Let’s look at what we “see” with clustering. With the Leeds Method the focus is on the 4 grandparents. I can assure you that each of our grandparents is associated with multiple TGs. I currently have 380 TGs covering 98% of my 45 chromosomes. Let’s round that to 400 TGs. This means that each grandparent would have roughly 100 TGs (or alternatively, each of those TGs would come from a distant ancestor, down to me through the one grandparent). If we have a Clustering Matrix with 4 large clusters, each one would almost certainly be from a different grandparent and maybe 100 TGs.

When we look at various clustering results, we usually see 4 clusters, or 8 clusters or 16 clusters, etc. This is a tip off that each cluster represents an Ancestor at some generation. NB: Clustering is not as precise as Triangulation, and not every Match in a cluster will be from the same Ancestor. Some Matches will share multiple Ancestors with us – sometimes from both sides (Paternal and Maternal) of our ancestry. The Clustering process has a hard time dealing with that – and it is an imperfect system. And all of the Clusters may not be from the same generation – but most will be. However, taken as a whole, the Clustering process does a good job of grouping Matches by Ancestors. Each Cluster will represent an Ancestor, even if every Match in it does not have that particular Ancestor.

Another way to look at this is to remember that each TG comes from one specific ancestral line (from a distant ancestor down to you). But turning this around is not true – we cannot say that each Ancestor passes down one TG. Clearly each parent, grandparent, great grandparent, etc. passes down multiple TGs.

With the above in mind, let’s look at Clusters and Match thresholds. The Leeds Method uses second cousins (2C) and above. The 2C are from a Great grandparent couple and represent their child which is your grandparent. This is why the Leeds Method tends to result in 4 Clusters of Matches – one for each grandparent. The “threshold” here is about 200cM (to accommodate most 2C which average about 230cM)

If we drop the threshold to about 60cM, we’d pick up mostly 3C (average 74cM) and closer, and we’d wind up with roughly 8 main Clusters in a Matrix – one for each of eight Great grandparents.  At AncestryDNA, they use 20cM as a threshold for 4C, but we’ve experienced many actual 5C-8C in this mix. From the Shared cM Project we have the average for a 6C at 21cM, so I’m pretty sure the 20cM threshold will pick up some 7C and 8C, too.

In any case, when I ran the Genetic Affairs Match Clustering program (using a 20cM threshold download from DNAGedcom Client), I got 3571 Matches and 158 clusters. That’s pretty close to 128 clusters, with one from each 5xG grandparent. This means to me that there were a number of 7C in the mix from 6xG grandparent couples, with a few more 8C Matches which split some of the clusters and brought the total up to 158. This seems to be a reasonable outcome as some of the clusters are only 3 Matches.

So my conclusion is that Match Clustering results in clusters around your Ancestors. And each cluster may include more than one TG. By selecting a threshold, you can roughly target the generation you want – 200cM for 4 grandparents; 60cM for 8 Great grandparents; 30cM for 16 2xG grandparents; and 20cM will get you roughly 128 5xG grandparents. It gets fuzzier with lower thresholds, because these lower thresholds can be from a range of cousinships.

The next, very important, step is to tag each cluster with the most logical CA (most from one generation). Over 230 of my AncestryDNA Matches are in known TGs (from uploads to GEDmatch or tests at other companies). In these cases some TGs might also be tagged to each cluster. This is exciting, because each Cluster can then point to a CA and selected TGs that the other Matches in the Cluster will likely have. I’ve already used the CA clues to find CAs for other Matches, and noticed that some Clusters tend to have only one or two TGs… Very important, and useful, clues!

[edited info on the Leeds Method 2/1/19]

[22Z] Segment-ology: Match Clusters – Chicken or Egg? TIDBIT (11 Feb 2019)

The Fundamental Building Blocks of Genetic Genealogy

A Segment-ology TIDBIT

In genetic genealogy, there are two fundamental building blocks: Ancestors and DNA Segments.

As genetic genealogists, virtually everything we do revolves around these two key elements. The Ancestors are really Common Ancestors (CAs) with a Match; and the DNA Segments can be grouped into Triangulated Groups (TGs). See How To Triangulate here. Each of your TGs is really a DNA segment (on one of your Chromosomes) that came from an Ancestor.

In Segmentology, the Two Fundamental Building Blocks are:

  1. Common Ancestors (CAs) – see my Shorthand ID for a CA here.
  2. Triangulated Groups (TGs) – see my Shorthand ID for a TG here.

These two fundamental building blocks, and their shorthand IDs, are valuable tools. Here are some examples:

Reasonable. Suppose I have a cousin/Match: 36P/4C on 01S24, with a 38.7cM shared segment. This looks reasonable. CA 36P has an ancestral line down to me as: 36-18-9-4-2-1, so it agrees with the 2-4 in 01S24 and both are on the P-side. And 38.7cM is in the range for a 4C.

Unreasonable. Suppose I have a Match on TG 01S24, with a 38.7cM shared segment, and then find a Common Ancestor 856M – I quickly know there are issues. 856M is a Maternal Ancestor and 01S24 is a Paternal TG. Also if the Match shares 38.7cM, the CA is not likely to be as far out as CA 856 [8th cousin range].

Impossible. Similarly, suppose I have determined a 256P CA with a Match, [paternal side]. The Match subsequently uploads to GEDmatch, and I find that our shared DNA segment is in TG 08B36 [maternal side]. We may still be a genealogy cousin on our CA 256P, but we have another CA on my maternal side who is linked to 08B36. Side note: this actually happened to me when I started with autosomal DNA. I worked hard to find CAs with 100 Matches before I really understood how to use the DNA. Later I determined that 25 of these CAs were impossible for the DNA segments which the Matches and I shared. 25% of the CAs were not linked to the DNA. Every time I find a CA without segment information, I think about this 25% error rate…

Very helpful. These two fundamental building blocks, and their shorthand IDs, are very valuable in analyzing various CAs we may find in a TG; or in reviewing a list of InCommonWith or Shared Matches, or Match Clusters. It takes some work to type them into the Notes boxes (at AncestryDNA, MyHeritage, FTDNA), but they sure are handy and helpful with analysis of groups. I’ll blog more about how to use these building blocks.

IMPORTANT BOTTOM LINES

  1. Finding CAs is genealogy work! We have to do this work – by reviewing a Match’s posted information or by contacting them (and sometimes by building their Trees).
  2. Forming TGs is a mechanical process – also work! I recommend trying to get as many shared DNA segments as you can into the appropriate TGs. Grouping your segments into TGs will save you time in the long run. See The Benefits of Triangulation here.
  3. Your TGs and CAs have certain specific links. Each TG will be linked to a specific ancestral line – often including several CAs at different generations with different cousin/Matches (aka Walking the Ancestor Back). Each CA will be linked to only certain TGs. Distant CAs may have only one TG; Intermediate CAs may have a few TGs and Close CAs will be linked to several TGs. See Figure 3 in this blogpost for an idea of how many segments (TGs) ancestors at different generations are likely to have. The point is that each of your Ancestors will link to only certain TGs, or none.

 

[22Y] Segment-ology: The Fundamental Building Blocks of Genetic Genealogy by Jim Bartlett 20190203

Standard ID for Triangulated Groups

A Segment-ology TIDBIT

In my spreadsheets and notes and analyses, I refer to Triangulated Groups (TGs) by a special ID name for each one.  For example: TG 01S24 breaks down as follows:

01 means Chromosome 01 – this TG is on that Chromosome

S indicates, roughly, how far out on that Chromosome the TG starts. Each letter is roughly 10Mbp wide. “A” means the TG starts between base pair 1 and base pair 10,000,000 (or 10Mbp); “B” means the TG starts between 10 and 20Mbp; “S” means the TG starts between 180 and 190Mbp. In fact, my TG 01S24 covers 182-229Mbp; and the next TG along Chromosome 01 is 01X24. I’m not a slave to this “rule,” and adjust where it makes sense. NB: Everyone will have a uniquely different chromosome map, and their TGs will have different locations.

24 indicates the grandparent in Ahnentafel. When I can determine a TG is on my Paternal or Maternal side, I use 2 or 3 respectively. When I can determine the TG is on a particular grandparent, I use 24, 25, 36 or 37. I only carry it out two generations (so far). NB: some people use P or M (for Paternal or Maternal), instead of Ahentafel numbers – take your pick.

If I’m referencing a Match, I might add the cM to show how significant the Match is in the TG. For example, Match A with 01S24, 38.7cM is much more significant than Match B with 01S24, 9.3cM. Clearly Match A is more likely to be a closer cousin (maybe a 4th or 5th cousin) than Match B (maybe well beyond my genealogy Tree)

BOTTOM LINE

Give each TG an ID

01S24 = A TG on Chromosome 01; starts 180-190Mbp; mapped to father’s father’s line

01S24, 38.7cM = A Match segment in TG 01S24 which is 38.7cM.

 

[22X] Segment-ology: Shorthand ID for Triangulated Groups TIDBIT; by Jim Bartlett 20190202

Shorthand ID for Common Ancestors

A Segment-ology TIDBIT

In my spreadsheets, Notes and analyses, I refer to Common Ancestors (CAs), or Most Recent Common Ancestors (MRCAs), by their Ahnentafel numbers.

Most of the time the MRCA with a Match is a couple, and I use the Ahnentafel number of the husband. For example: 36P is my father’s father’s mother’s father’s father (or 1-2-4-9-18-36 with the Anentafel number for each generation). This 36P shorthand actually refers to the 36/37 couple (Thomas NEWLON and wife Susan in my case). I add on a P or M to indicate the Paternal or Maternal side, as this is not obvious with larger Ahnentafel numbers after several generations.

Just to keep my bearings, I also usually indicate the cousinship of a Match – for example: 4C (4th cousin) or 4C1R (4th cousin once removed), or 3Cx2 (double 3rd cousin), or 2C/2 (half 2nd cousin). So the shorthand ID is usually something like 36P/4C1R – a lot of information packed into a compact ID. And, given this shorthand ID, I can always repeatedly divide the Ahnentafel number by two to get back down to me. For example: 856M breaks down to 428-214-107-53-26-13-6-3-1 (me); which is on my mother’s father’s side. I can easily tell that other Matches with 214M and 53M and 13M MRCAs are all on this same ancestral line.

BOTTOM LINE

Use a Shorthand ID for CAs and MRCAs

36P/4C1R = the CA is Ahnentafel 36, Paternal side; the Match is a 4th cousin once removed

 

[22W] Segment-ology: Shorthand ID for Common Ancestors TIDBIT by Jim Bartlett 20190202