How Many TGs From Distant Ancestors?

I was recently asked if I’d thought about this question. The quick answer is YES – the answer to this question is at the core of my belief that genetic genealogy is valid out to 9 generations back. And I think this question is really two questions: one about the Triangulated Groups (TGs) themselves; and one about the Matches with shared DNA segments within each TG.

How far back do our TGs go?

Using a 7cM threshold for shared DNA segments, I’ve documented 372 TGs, covering over 98% of my DNA. These TGs have natural breaks [recombination crossover points] between them. These TGs represent actual DNA segments, on my chromosomes, which are from my Ancestors down to a parent to me.  So how far back do they probably go?

The number of segments we have at each generation of our ancestors is fairly easy to estimate. Using a female to make it easier, she gets 46 segments from her two parents – in the form of 46 chromosomes. Pretty big segments…  Using the average recombination rate of 34 crossovers per genome (per parent), she would get 68 additional segments one generation back. In other words she would have a total of 46+68=114 segments from her grandparents. And she would get 114+68=182 segments from her Great grandparents.  Here is a handy table I made up for my reference:

This table starts with me at the bottom and shows the generations back, the number of Ancestors at each generation back, the generic name of those Ancestors, the relationship of my cousins who share a Common Ancestor with me at that level, the calculated percentage and cM amount of DNA I got from each of those Ancestors (at any given number of generations back), the calculated average number of segments in my DNA from all the Ancestors in any given generation, the average cMs per TG; and in the last two columns the average and range of cMs collected in Blaine’s cM study. The first column is just for a very rough estimate of the birth year of my Ancestors at any given generation (it helps me).

Highlighted in yellow is the 386 segments expected (roughly) from my 3xG grandparents. That’s roughly the same as my 372 TGs. So I expect some kind of distribution curve around that point. Matches who share the full DNA segment represented by a TG would probably be 4th cousins (4C). Due to the random nature of DNA, I expect a range from 2C to 7C or 8C. My TGs range in size from a few just over 7cM to some around 50cM – it all depends on several variables.

Another aspect of this discussion has to do with what I call “sticky” segments. Per the Table above at 5 generations back we would see 386 segments – or 386 TGs – of about 18cM each. But going back one more generation – one more round of 68 crossover points would result in 454 segments. This means that 64 of the 386 segments were subdivided, and 322 segments were not! This means that 322 segments (TGs) were passed down intact (no recombination). The effect of this is that many TGs will persist, at the same size, for several generations. We could well see the same size TG from a 6xG grandparent to a 5xG to a 4xG to a 3xG grandparent. So it would be possible for a 7C, 6C, 5C and 4C to all share the full size DNA segment represented by the TG. Clearly the probabilities of that decrease as the cousinship increases.

Bottom line from my experience: I think we’ll find most of our TGs to be within a genealogical time frame of, say, 9 or 10 generations. And there is always the opportunity for closer cousins to share a DNA segment within any of our TGs.

How far back do the Matches go?

This is a different, but related, question. The above discussion was all about the full DNA segment represented by a TG. Most of our Matches in a TG will not share the full DNA segment. They overlap us or are wholly included within the TG segment. For example, the Matches in 20cM TG can range from sharing 7cM up to 20 cM. And, in fact, some of our closer cousins may share 35cM and span across more than one TG. It’s very random. However, to the point of the question – many of our Matches who share, say, 7 to 15cM may well be cousins beyond the Ancestors who passed down the full TG. To be sure, the Common Ancestors in this case would be ancestral to the TG Ancestor, but it could be 10, 20, or more generations back.

Bottom line: Matches in a TG are limited to a narrow range of your Ancestors, but they are not limited by how close or how distant they could be. And Matches who share small segments may well be beyond a genealogical timeframe; but some will be within a genealogical timeframe. Witness the Ancestry ThruLines Common Ancestors down to 6cM.

Summary: I think most TGs will be within a genealogical timeframe (using a 7cM threshold for shared DNA segments). The Matches in a TG will range from close Matches, out to Matches on the fringes of our genealogy and on out to Matches who will be beyond our genealogy.

 

[19H] Segment-ology: How Many TGs From Distant Ancestors? By Jim Bartlett 20191217

A Unified Theory of Genetic Genealogy

Bottom Line Up Front (BLUF):

Triangulated Groups = Clusters = Common Ancestors

Brief overview: Each of us has a specific genealogy Tree of Ancestors; and a fixed arrangement of our DNA segments from those Ancestors. I believe our DNA segments are reflected in our Triangulated Groups (TGs) of shared DNA segments, which are from specific Common Ancestor (CAs), and that each CA is represented by a specific Cluster of Shared Matches. I believe there is alignment between the TG CA and the Cluster CA, which can be very helpful. Put another way, each of our Ancestors will have a specific TG/Cluster combination, and at some point in our Tree there will be one TG and a corresponding Cluster for each Ancestor.

TRIANGULATED GROUPS

In these blog posts I’ve often stated that each segment Triangulated Group (TG) is from a specific Common Ancestor (CA) – in other words the DNA segment identified by a TG came from a specific Ancestor down the line of descent of your Ancestors to you. The Matches in a TG will be relatives (usually cousins) along one of your ancestral lines.  For example, if a TG is from a 6xG grandparent (7th cousin (7C) level), some of the Matches may be cousins from 1C to 7C; and some may be from Ancestors beyond the 6xG grandparent – perhaps (usually with shared segments below 15cM) somewhat beyond the 6xG grandparent.

Because of the random nature of DNA, and the wide range of cMs for cousins beyond 3C, there is no set of parameters (short of complete chromosome mapping) that will get you only TGs at one generation. For instance, I know of no cM parameter that will get you only TGs at, say, the 6C level – or any other level. So we usually wind up with a mix of TGs at different cousinship levels.

TG Outliers

Like with most things DNA, there may be some outliers, and not every Match in a TG will be found to share an IBD segment (in other words, some Matches with small shared DNA segments  – under 15cM – may be false Matches). But the important take-away is that the TG will represent a CA, even if a few Matches are false.

TG Bottom Line

Your DNA has fixed crossover points. Depending on the cM threshold you use for comparing shared DNA segments, your data will have natural break points between TGs. I used a 7cM threshold and got 372 TGs covering over 98% of my 45 Chromosomes. It was hard work doing all the comparisons and culling out the false shared segments. I have Matches who are 2C to 9C for about 80% of these TGs.

SHARED MATCH CLUSTERS

Recently, I’ve been blogging about Clustering. Clusters appear to come from a specific CA down the line of descent of your Ancestors to you (just like the description of a TG).

When I did a Clustering run of all my 5732 Matches at Family Finder, I got 352 Clusters which had a very high correlation to my 372 TGs.

Well… duh! When we consider that each of us has fixed segments in our DNA and fixed ancestors in our Tree, we understand that each of us, in our own unique ways, has a specific “solution” (Ancestors linked to DNA segments). So if we look at grouping by Clusters, it should reflect that “solution”. And when we form segment TGs, they should reflect that “solution”. And in combination, the Clusters and TGs should reflect the same “solution”.  In other words the Clusters and TGs should align.

In my opinion, Clustering with Shared Matches is a sophisticated way of grouping Matches based on the probability that a number of Shared Matches who mostly match each other, will be from the same CA.

Clustering Outliers

Like with most things DNA, there may be some outliers, and not every Match in a Cluster will be found to share the same CA. But the important take-away is that most do share the same CA, and the Cluster will represent that CA, even if a few Matches don’t.

Cluster Bottom Line

Your Shared Matches will tend to Cluster on CAs. Depending on the cM threshold you use for comparing shared DNA segments, your data will divide into different numbers of Clusters. See my experience here; and the process here. I used a 6cM threshold and got 350 to 382 Clusters, covering at least all of my 4xG grandparents and some out to 8xG grandparents. It was relatively easy to run the Cluster programs to get the Match/SharedMatch data, and relatively little work to determine a consensus of a CA for each Cluster, for each run at different cM levels (smaller thresholds result in more Matches and Clusters, and more work). I can see CAs out to 8C for some Clusters. [NB: Clustering does not find the CAs – this is homework you have to do before Clustering: find as many CAs as possible and put that information in the Notes, so it’s available for analysis at each Cluster run].

ACTION – USE CLUSTERS to form TRIANGULATED GROUPS

I’ve spent a lot of work over the past 8 years determining my 372 TGs (your number of TGs may vary, but I believe using a 7cM threshold for Shared Segments, it come out at this order of magnitude). Triangulation, even with the tools at 23andMe, MyHeritage and GEDmatch takes time and work. In contrast, Clustering is relatively simple – pretty close to a “click” process. If Clusters are the same as TGs, we should be able to run a Cluster report on all of our Matches (at a company which also provides segment data), and then easily sort on the DNA segment data (sort by Chr and Start), and then relatively easily scroll down the several thousand Matches and group them into TGs. Yes, this scrolling will take some work, but it’s a whole lot easier than comparing each shared DNA segment pair in a browser. I believe the combination of Cluster numbers and segment data will easily define the TGs – maybe just a little “quality control” at the end, depending on how the data looks.

I have my brother’s DNA at FTDNA and 23andMe – I’m going to try this process on his results, and will report back.

The Bottom Line

Once you determine your TGs and the CAs that go with them, you have a Chromosome Map!

My Bottom Line

I’m trying to demonstrate:

  1. TG=CL=CA
  2. The CA will be in the 7C-9C range*

 

*I recognize that my belief that our DNA tests can accurately determine our CAs out to 8C, or so, is not held by most genetic genealogists. But based on my experience, particularly using Walking The Clusters Back, I believe this is a realistic range – easily and accurately obtained – and confirmed by both TGs and Clustering.

With our fixed Ancestry and DNA crossover points, each process should give us the same “solution” – whether we use DNA Painter, Kitty Cooper’s Chromosome Mapping, GenomeMatePro, Visual Phasing, Double Match Triangulator, etc., etc. We are just using different tools to “see” the chromosome map.

 

[19G] Segment-ology: A Unified Theory of Genetic Genealogy by Jim Bartlett 20191216

Walking The Clusters Back III

Progress Report – Observations…

Main benefits, so far:

  1. Impute Cluster Common Ancestor (CA) to other Matches in the Cluster – this let’s us focus on individual Matches – look at their Tree with a CA in mind, and/or communicate with the Matches and ask about a specific Surname or Ancestral line.
  2. Compare Cluster CA to ThruLines CA – if the same, we have reinforcing evidence; if different, the ThruLines CA may be wrong, or it may be correct genealogy, but the Match has another CA linked to the DNA (and the Cluster).
  3. Link some Clusters (and the CA) to a Triangulated Group (TG) – this will strengthen the evidence of the Ancestral line of a TG. Often the Cluster CA is more distant than the CA found in TGs at 23andMe, FTDNA, MyHeritage or GEDmatch.
  4. As the threshold decreases, there are more Matches included in the Clustering process, and those Matches tend to have more distant CAs with us. Clusters will start with only 2C and 3C; and grow to include 4C and 5C, etc. We can see the Walking The Cluster Back happening within each Cluster. Eventually each Cluster will begin to show several generations of CAs – they should all be on the same Ancestral line [if not, check with the correlated Clusters]
  5. Clustering reduces the range of possibilities. If a Cluster has a CA of A18 [Ahnentafel number for a specific 2xGreat grandparent = father’s father’s mother’s father], there are only two possibilities for the next generation: A36 and A 37 (although a Match may share a CA another generation back: A72, A73, A74, or A75). If a new Match in the next (lower threshold) Cluster run has CA = A74 – this is reinforcing evidence. If the new Match has CA = A88 – something is amiss [check for another CA, check for a correleated Cluster which is A44, or A176, etc.]

Main issues, so far:

  1. It’s been hard to specifically find Clusters which split into two Clusters a generation further out. Many Clusters have included CAs which span several generations on the same line. I’m inclined to “go with the flow” [accept the Clusters with CAs on the same line]; and not try force Clusters into a “genealogy Tree” structure. The data is just too variable. Maybe when I get down to a 6cM threshold, it may play out that way – but, I have a feeling the vagaries of random DNA will make that a wild goose chase.
  2. Several BIG variables combine to give us trends, rather than a uniform picture/pattern:
    1. Our Ancestry (size of families, documentation, probable NPEs at some level, etc.)
    2. Our random DNA from different Ancestors
    3. Which of our cousins have DNA tested.
  3. Homework is needed – Clusters can be formed, but some genealogy is needed to identify the CAs. I recommend building a Tree of Ancestors out 7 generations, wherever possible – with that AncestryDNA will find ThruLines CAs for you. Enter those CAs (or their Ahnentafel) into the Match’s Notes, so that information will be available in the different Clustering runs.

My status so far is summarized in this Table of different Cluster runs:

The first column shows the decreasing thresholds I used (basically every 5cM) – the top line is the original download: 6cM threshold, 119,068 Matches (and all their Shared Matches) which took 9 hours and is in a .txt file.

The # Matches and # Clusters are for the various cluster runs – which take negligible time to produce an Excel Cluster report.

The 3C, 4C, 5C, etc column show how many Clusters I got with CAs at those levels. (There were some 2C, but they were in Clusters that also had 3C – I counted each Cluster with the most distant cousinship which had a consensus.)

The larger threshold Clusters had multiple TGs in them. Beginning about at the 35cM threshold, some of the Clusters started showing a single, or consensus, TG – so I counted them.

Starting after the 45cM threshold, the number of included Matches about doubled with each decrease of 5cM in the threshold, and the number of Clusters began increasing dramatically. This means the amount of work for scrolling down the entire report, analyzing the data in each Cluster and determining the consensus, also increased a lot. Sometimes the CA and/or TG of a Cluster is very clear; sometimes a Match’s correlated Clusters must be reviewed and the Match assigned to another Cluster. And all new Matches need to be “Tagged”; and often other Matches need to have their “Tag” adjusted [what I alluded to in the Iterative WTCM Process], as new Matches and their new information are added to the Clusters.

The good news is in the TG column – where about 1/4 of the Clusters (and CAs) can be linked to TGs [I have TGs for over 300 Matches at AncestryDNA].

More good news: The Shared Clustering program, below 20cM, will first Cluster on the 4,515 Matches, basically retaining the 382 Clusters constant, and then go back and add in the new Matches. Therefor, all the Matches below 20cM (including about 300 with TGs, and over 1,500 with CAs) will be added to the existing Clusters, and very probably push the Cluster CA out even farther and add the a TG to many of them.

Note the trend in the 35, 30 and 25cM Cluster runs to more distant cousinships. As I find the time to analyze the 20cM Cluster run, and then runs at 15cM and 10cM, I expect this trend to continue, giving me many more CAs in the 6C, 7C and 8C range. Of course these are all clues, but I believe they are very strong clues. Time will tell as I investigate each Cluster/CA/TG more deeply.

 

[19F] Segment-ology: Walking The Clusters Back III by Jim Bartlett 20191214

Walking The Clusters Back II

I’ve found it difficult to hit the 8-16-32-64 Cluster targets. Even when I come close , they don’t wind up all in one generation. The DNA is just too random. Just look at Blaine Bettinger’s cM charts to see that there are wide cM ranges for most of the cousinships. Therefore, there is not a magic threshold for any given generation.

A better plan for WTCB is to increment the cM range a little, say 5cM, and examine the new array of Clusters. Where did the Matches from the previous Clusters go? Look for cases where the Matches from one Cluster are now split between two Clusters – almost certainly these two Clusters will represent the parents of the Cluster that held all of the Matches before. As you lower the threshold incrementally, expect the Clusters formed on close Ancestors to disappear, and new Clusters to form on more distant Ancestors. Trace the Matches from Clusters that disappear to their new Clusters for very strong clues of the ancestral line. Once you are confident of the Common Ancestor (CA) of a Cluster, there are only two options for the next generation – the two parents of the Common Ancestor previously determined. And, with the lowering of the Clustering cM threshold, new Matches will be added to the mix. These new Matches have (incrementally) smaller shared segments with you, and, in general, will tend to be more distant cousins. To be sure, each new batch of Matches (as you lower the threshold each time) will probably include a range of cousinships with you. Each Cluster CA is a hypothesis, and as new evidence (new Matches) is added to the mix, everything needs to be reviewed for consistency, and discrepancies resolved. Sometimes a discrepancy is resolved by moving the Match to a correlated Cluster.

Those who try this process are encouraged to provide feedback in the Comments to this blog.

 

[19E] Segment-ology: Walking The Clusters Back II by Jim Bartlett 20191205

Walking The Clusters Back

A Segment-ology Concept

Overview

Walking The Clusters Back (WTCB) can be a fairly complex process, so let’s start with an overview of the concept.

Pick a Clustering threshold high enough to give us 4 Clusters – one for each grandparent. Tag each Match in these Clusters to the appropriate Ancestor (grandparent). Then adjust the threshold to get (roughly) 8 Clusters (one for each Great grandparent). These Clusters would include all the Tagged Matches who would indicate which grandparent line each new Cluster was in; as well as new, generally more distant Matches, who would then separate these Clusters into Great grandparents. Tag, or re-Tag, each Match in these Clusters to the appropriate Ancestor (Great Grandparent). Then lower the threshold to get 16 Clusters and repeat the process.

Up Front DISCLAIMER: This is not a simple click-and-done process. There is homework to be done before Clustering:  documenting the Common Ancestors in our Matches’ Notes. Although the time it takes to run various Cluster reports is only a few seconds for each one, it takes some time to analyze the Matches in each Cluster and come to a consensus, then transfer those clues to the next Cluster, and then analyze those Clusters. It turns out this WTCB process is iterative –two steps forward, then one back. Each new set of Clusters brings in new Matches with clues that need to be reconciled. WTCB is  somewhat easier than Triangulating all your Matches, but it is still time-consuming work. Nevertheless,  WTCB is a great opportunity to get the most out of your Matches at AncestryDNA.

Background

Clustering is a way of grouping your Matches. Each Cluster tends to group Matches who descend from the same Ancestor. The Leeds Method groups close-cousin Matches into four Clusters which are usually our four grandparents. On the one hand this depends on knowing the Common Ancestor with some of the Matches; and on the other hand it provides a strong clue about the Ancestor of other Matches in a known Cluster. And if some Clusters are known, the others may be determined by logic. Clustering provides a powerful grouping tool. I posted about Grouping Matches here; and about several Clustering programs here.

The Leeds Method uses a high threshold (90-400cM) for the Matches to be included in the analysis, so that only 2nd or 3rd cousins are used. Each cousin in this range would usually be from only one of our four grandparents. What would happen if we lowered the threshold just enough to only have 3rd or 4th cousins most of the time? Generally they would tend to form eight Clusters – one for each of our eight great grandparents. However, as with most things “DNA”, as we decrease the cM threshold, we get a wider range of relationships – it’s not very probable that we could find a cM threshold that would produce exactly eight Clusters, or even succeeding in that, that there would be a 1-to-1 relationship to our eight Great grandparents.

And if we decided to jump to the ultimate and Cluster on 6cM, I can tell you that we’d get hundreds of Clusters. Some of them we might be able to identify, but most will look like “mush”. And, like finding a DNA Match who is a 9th cousin, we wouldn’t really have much in the way of corroborating evidence.

However, we often do have a lot of data to work with, and a good tool like Clustering that makes it fairly simple to group our Matches…

It struck me that maybe we could Walk The Clusters Back (WTCB). See the concept in the Executive Summary. In theory the 8 Clusters would be husband/wife pairs – the parents of the grandparents in the original 4 Clusters. The 4-Cluster Matches would carry a “tell-tale” Tag of whence they came to the 8 new Clusters. Then, hopefully, with clues from the new Matches in the 8 Clusters, we could determine which of the great grandparents each of the 8 Clusters represented. We are down to only two options for each Cluster, and if we can figure out one, the other would determined by logic: i.e. the other parent.

In general this worked! But it didn’t always work per the theory (the DNA is random), and the process was arduous.

Problems

– Even by adjusting the Cluster threshold 1cM at a time, the Clusters rarely came out to 8, or 16, or 32, or 64, or 128. And even when it came close, all of the new Clusters were not necessarily just the parents of the previous Clusters. Sometimes one Cluster would split into 3 or 4 Clusters (a parent and 2 grandparents, or 4 grandparents). And from one iteration to the next, some Clusters didn’t split at all. As the threshold dropped and the number of Clusters increase, the deviation from the theory increased. Not wholesale, not all of them – but enough to notice.

– In my case (one grandparent who had very few Matches, and thus very few Clusters), I knew I would get only 2 Clusters max for that one grandparent. If you have a special case, you need to make some adjustments (I used 4-8-13-26-50-98 as my target Cluster numbers and took whatever was closest).

– Sometimes in a Cluster, I didn’t get any/many Matches who had a CA. Each new Cluster depends on Matches with CAs to inform us about the probable CA for the Cluster. I could usually make up for this in the next iteration, but that meant I had to backtrack – which I did more and more as the number of Clusters increased.

– I had to find a way to transfer the Ancestor tell-tale Tags from Matches in one set of Clusters to the Matches in the next Cluster run. More on this later.

Developing the process

In step 1 of the above figure, all the Matches above a 90cM threshold are Clustered into 4 groups – one for each grandparent. In step 2, based on the known genealogy of some of the Matches, I determined the 4 grandparent Ancestors based on information I know about the close Matches in each Cluster. In step 3 I adjusted the threshold to create about 8 Clusters, and noted which of the Matches from steps 1 and 2 are now in the new Clusters. Generally the Matches from each of the 4 Ancestors (grandparents) in step 2 will be found in only 2 Clusters at the step 3 level. There are only two options for these two Great grandparents, the husband and the wife of the Ancestor in step 2. Often at this point, we don’t always know which is the husband and which is the wife, but we can be confident it’s one of each. However, during step 3 we also get additional Matches (not shown) in the Great grandparent Clusters. Some of these new Matches may be 4th cousins who would provide insights on the identity of the Great grandparents. The other Great grandparents will become known after step 4 as shown below.

In step 4 above, we lower the threshold again and get roughly 16 Clusters. I’m just using a single green arrow to show all the new Matches who were 4th, 5th and 6th cousins who formed the 16 Clusters (along with the closer red arrow Matches from steps 2 and 3). At this point, all we’d need is the CA of some of the more distant Matches, in order to determine the correct Ancestors at level 3. Note that two of the green arrows don’t point to one of the 8 level 3 Ancestors – don’t worry about it. These “errant” Matches will often fall into Clusters in the next iteration. Or maybe they will continue to be strange. Again, don’t worry about it – focus on the Ancestors you can determine. Some of the data will get a little messy as the threshold is dropped. Focus on the positive outcomes – the identity of the Cluster Ancestors at each generation.

Step 4 above shows that this is an iterative process: we sometimes need the CA couple information from one generation to resolve the individual CA of a closer generation. Note that this changes the Tags, which needs to be reconciled in a prior generation Cluster.

The Iterative WTCB Process

This led me to modify the theoretical process at the beginning of this post to the following iterative, or zig-zag, process.

  1. Set the threshold to get 4 Clusters; Assign each Cluster to a grandparent Ancestor; Tag each Match in 4-Cluster with the appropriate Ancestor.
  2. Reduce threshold to get 8 Clusters [with Tagged 4-Cluster Matches plus additional new Matches (some with CAs in Notes)]; as best you can, assign Clusters to Great grandparents; Tag all Matches with what you know so far.
  3. Re-run 4-Clusters, to insure the new Tags are OK, adjust as necessary.
  4. Re-run 8-Clusters, and adjust as necessary.
  5. Reduce threshold to get 16 Clusters [with previously tagged Matches plus additional new Matches (some with CAs in Notes)]; as best you can assign Clusters to 2xG grandparents; Tag all Matches.
  6. Re-run 8-Clusters; usually some new Tags will clarify any Great grandparents that were unclear in Step 2. Re-Tag as appropriate.
  7. Re-run 16-Clusters and adjust as necessary.
  8. Repeat steps 5, 6 and 7 (adjusted to for the next round): Reduce threshold/assign/Tag; revisit the prior Clusters/Re-Tag as appropriate; re-run the current Clusters/adjust as necessary.

This 2-steps forward then 1-step back process is necessary because it’s not always clear when one Cluster splits into two new Clusters which is the father and which is the mother – this usually becomes clear in the next round which includes more distant cousins. If something doesn’t work out in one round, it will probably get resolved in a subsequent round. Until, of course, you run out of sufficient data. Theoretically, each of your Ancestors with ThruLines Matches will be incorporated into a Cluster. Generally that would result in a lot of known Clusters. And if some of your Matches uploaded to GEDmatch or tested at one of the other companies, you’d also have TG information included in the Clusters. Happy days of DNA Painting or Chromosome Mapping!

Some Additional Items

Homework before WTCB

Create Notes in AncestryDNA for as many of your Matches as you can (hopefully you’ve been doing this all along). See my previous blog posts about AncestryDNA Notes: Format; Using Notes; ID for CAs; ID for TGs. These Notes are then handy and invaluable in the WTCB process – they provide the clues that let you determine a consensus in a Cluster. They remind you of the CA and TG information you’ve already gathered.

Work with generations

It would be possible to start with a large cM threshold to get 4 grandparent Clusters, and Tag the Matches. Then decrease the threshold by 1cM and run a new Cluster report. Has any new Cluster been added? If no, then decrease the threshold by another 1cM and repeat. If yes, it’s probably the result of a split of a previous Cluster into two parents. Identify the parents, and Tag the Matches appropriately. Then decrease the Cluster threshold by 1cM and repeat. This would take a lot of work.

The process I choose was to lower the Cluster threshold by enough to create roughly twice the number of Clusters. This would basically be creating the next generation of Ancestors. The random DNA doesn’t allow this to work perfectly, but it does tend to subdivide previous Clusters into new Clusters. Focus on identifying (through Tagged Matches) which Clusters came from previous Clusters; and then identifying the next generation of Ancestors in these new Clusters. Again, this doesn’t work perfectly each time. But don’t worry about it, a subsequent set of Clusters (with new Matches and new information), will usually provide resolution (through the iterative process – see below).

Tagging Matches

The key is to Tag Matches and carry over this information to the next set of Clusters. Now in this new set of Clusters you have Matches with information carried forward, plus new Matches, some of which have known information of their own (available in the Notes). Again, your focus is to achieve consensus in each new Cluster, and re-Tag the Matches..

I use the Ahnentafel number as the Tag. Other options include: Ancestor name, Ancestor initials, or Ms and Ps (e.g. MP for maternal grandfather). The Cluster number changes with each different iteration, so don’t use that.

One way would be to add the Tag into the Shared Clustering spreadsheet Note field for each Match; then use the Shared Clustering program to upload the Notes back to AncestryDNA and also to the Download file (where it would be available for the next Cluster run – remember the Cluster runs on the Download file only take a few seconds).

I choose to use the Shared Clustering spreadsheets (after stripping out the colorful Clusters – keeping just the data). I combine two Cluster files, sort on Match name, and copy the Tag from one Match to the other. This is relatively easy for small Clusters, it gets more time consuming with each iteration.

An iterative process

WTCB is definitely an iterative process. When we add new generations of Clusters and find new Matches with clues, we need to then backtrack to previous Cluster runs with this new information. Why backtrack? Because with each succeeding generation of Clusters we are (roughly) adding two “parent” Clusters for each one we had before. Sometimes we don’t have enough information to distinguish which of these two “parent” Clusters are the father or the mother. But in the next Cluster iteration, we find new information among the new Matches who are added in each round. When we backtrack with that information, and designate one of the two “parent” Clusters as, say, the mother, then we can impute the other one to the father – and then further Tagging all the Matches in those two Clusters, and all subsequent Cluster runs.

All the Clustered Matches are valuable

At first I intended to cull out the Matches for which I had no Notes – they didn’t appear to add any value. They didn’t reveal a TrueLines CA or a TG or anything else – nada. But then I realized they did add value – they were part of the heat in the heatmap. They added the value of being assigned to a Cluster (because of their Shared Matches), notwithstanding the fact that I didn’t know anything specific about them. In the next iteration of Clustering (with a lower cM threshold) they would also be included. If I Tagged them per the current Cluster, this information would carry over to the new Clusters. In theory (and borne out in practice), they tended to divide between two Clusters in the next iteration. Of course they couldn’t help me figure out the CA of those two clusters, but the fact that they helped form the Clusters gave weight to the new Clusters. Sometimes they were the Matches needed to actually form a new Cluster, and without them I wouldn’t get that Cluster. And their Tag told me I had only two possibilities for these new Clusters – the two parents of the Tag. And in these new Clusters there were new Matches – sometimes ones who were ThruLines Matches with a CA, or a Match (with no genealogy) who had uploaded to GEDmatch, and thus had a known TG. And sometimes I got nothing from these new Matches, but then did find new clues in the next generation of Clusters.

The value of Common Ancestors with low cM shares

Some of my Matches with CAs (from ThruLines out to 6C or Circles out to 8C) have smaller shared segment cMs – all the way down to 6cM. I treat these as clues – they can be helpful in developing a consensus with other evidence. With this WTCB process, we only have two Ancestor options at each generation, so even a 6cM Match may be a valuable clue.

Two kinds of Imputation possible

Cluster Imputation: When one of two “Parent” Clusters can be determined (usually by a  Match who has a known CA); the other “Parent” Cluster can be imputed to be the other Parent. Then all of the Matches in both “Parent” Clusters can have their Tags adjusted appropriately.

Match CA Imputation: Cluster CAs (or the ancestral line) can be imputed to all the Matches in a Cluster. On many occasions, with Clusters with a strong Match consensus for a CA, I’ve gone to other Matches in the Cluster looking for that CA and found it.

How Far Can We Go?

Continue this out as far as you want? Well, it’s not quite that easy. There are several things at play here:

  1. As noted above, the lower the cM threshold used, the wider the range of relationships we’ll get. As the threshold drops, we’ll see a wider range of Ancestors for each Cluster.
  2. Some of our Match-cousins share multiple relationships with us. Just look at your ThruLines Matches to see the number of them who share more than one ancestral couple with you. At a 20cM threshold, 65 of my 296 Matches (22%) share more than one pair of Common Ancestors with me.
  3. Some of our Match-cousins share multiple DNA segments with us. This means those Matches could share multiple Ancestors with us. Which one should we use? At a 20cM threshold, 1744 of my 4506 Matches (almost 40%) share more than one DNA segment with me.
  4. However, if a Cluster with Matches at one level, splits into two Clusters, each with some of those same Matches at the next level, it’s fairly safe to use that information as a strong clue (or hypothesis).
  5. We now have ThruLines at AncestryDNA. I have over 1,800 Matches in my ThruLines, and they “cover” all of my known Ancestors out to my 5xG grandparents (6th cousin level). Those that wind up in Clusters, provide valuable clues about the Ancestor for those Clusters. Couple that with the Matches with “Tags” from previous Clusters, and you have reinforcing (or conflicting) clues.

 

Here is a Table of my Cluster iterations – I used the highlighted ones in this study.

SC = Shared Clustering program by Jonathan Brecher (used for this WTCB analysis)

Concluclusions:

  1. It is possible to Walk The Clusters Back. I think the trickiest part is assigning Tags to Matches that stay with the Match in succeeding Cluster runs. I plan to try using the Shared Clustering Upload program for that (upload to Ancestry and the Download file)
  2. WTCB is not a simple “click” process – it involves homework (CAs in AncestryDNA Notes), and judgment, logic and time working with the Cluster iterations.
  3. It gets harder (both in logic and the number of Clusters and Matches involved) with each iteration.
  4. Some of my Matches have uploaded to GEDmatch or tested elsewhere and I have TGs for them. WTCB will provide a strong clue for the CA of these TGs.
  5. I think the realistic limit will be around 7xG grandparents (8th cousin level).
  6. WTCB helps us impute CAs to Matches.

 

[19D] Segment-ology: Walking The Clusters Back by Jim Bartlett (20191201)

Shared Clustering – A Great Tool!

A summary of some different Clustering programs is here. I’ve used, and liked, most of these programs, and I want to highlight one of them here.

Shared Clustering by Jonathan Brecher is a good, flexible tool – it does what I want, quickly. It doesn’t have the glitz of Genetic Affairs or other features offered by DNAGedcom Client. But it gets the job done for me, efficiently, and it’s free. Some detailed steps at the bottom of this post.

Some comments on Shared Clustering:

– I used a 6cM threshold and downloaded all my 118,853 Matches (and Shared Matches) at AncestryDNA in 2 hr 34min.

– I then ran a Cluster report with a 90cM threshold in 2 seconds (that’s not a typo): 34 Matches in 8 Clusters.

– Each Cluster is assigned a number.

– Each Match is shown in one Cluster – the one with the most matches to other Shared Matches – the most “heat” in a heat-map program.

– AND all of the Correlated Cluster numbers are also shown for each Match. These are Clusters where the Match also has an affinity – the Match shares some Shared Matches with the rest; just not as much as in the Cluster it’s assigned to. This is very handy, because sometimes our known relationship with the Match would be a better “fit” in one of the other Clusters – feel free to use judgment and assign a Match to any Correlated Cluster you want. OR, if a Match shares two segments with you, assign it to two Clusters. Omygosh – that violates the Cluster “rules”! But this is your data now, use your own judgment and bend the rules a little – just don’t get too wild…

– I ran multiple other Cluster reports, each one took only a very few seconds.

– With a threshold of 28cM, I get 1105 Matches in 94 Clusters – in 4 sec. For me, that’s about one Cluster for each of my 5xG grandparents. Of course, it won’t fall out exactly this way, but that’s the general area of my Tree I’d be working in with these Clusters. Remember: Clusters tend to form on individual Ancestors.

-Each report includes a one-click link to each Match’s DNA page with me – very handy.

-All of the ThruLines Common Ancestors (CAs) are also included for each Match – a convenient check, if you haven’t already summarized each of them in the Notes. Or if you are just checking for new Matches among your ThruLines.

-Each report includes all of my Notes (into which I’ve already summarized ThruLines, other CAs and TGs).

-VERY IMPORTANT: I can modify as many of my Notes as I want in the spreadsheet, and then easily click to upload that info back to AncestryDNA (it overwrites the Notes that I’ve changed – WOW, what a time saver). This uploads in under a minute. Use this feature to summarize ThruLines CAs into your Notes (if you haven’t already), and upload that back to AncestryDNA. Use the “Upload Notes” TAB.

-ALSO IMPORTANT: I can use the “Export” TAB to download my AncestryDNA data, including Notes, to an Excel file, giving me an inventory of all my Ancestry Matches (without the Clusters or Shared Matches). This is my go-to file whenever I’m searching for an Ancestry Match (like from a name or email at GEDmatch). It’s much better than using the AncestryDNA search system. And the hyper-link means I am just one click away from my Match’s DNA page with me.

Some steps to get started:

Go to this page to download the program to your PC: https://github.com/jonathanbrecher/sharedclustering/wiki

Read the Home page, and then click on the download link on the right side (if you get a popup warning, tell your PC it’s OK)

Read the Introduction TAB, then select the Download TAB

You are now working from your own PC – enter your Ancestry username and password and select your test.

I click on “Slow and Complete”, but feel free to try each of the radio buttons. I set “Lowest centimorgans” to retrieve to 6cM and get all my 118,000 Matches in about 2.5 hours. Note where your file is stored. If you set “Lowest centimorgans” to 20cM , you’ll get all your “forth cousins” and closer in less than 10 minutes – this includes all the Matches who are used as Shared Matches.

After the Download is complete, select the Cluster TAB – the Saved Data File (from the Download) is usually shown by default, but you can also use files downloaded from other companies, if you want. The Cluster output file usually shows by default too – it’s the same name as the Saved Data File with “-clusters.xlsx” appended instead of “.txt” You can change the name of this file if you want – I usually append the default cM I’m using (e.g. 28) after “clusters” so I can save them all with different, recognizable, names. Just make sure both files (the Download “txt” file and the title of the new clusters “xlsx” file) are in the same folder. I’ve also set up a Clustering folder, and a sub-folder for the Shared Clustering program, and separate sub-sub folders with a date of the initial download file (e.g. 20191123) – so the Download and each of the Cluster runs would go in that (20191123) folder. A little work on organizing a file system really helps me remember what I’m doing….

Click on the “Cluster completeness” button of your choice; and type in the “Lowest centimorgans” box. Then hit Process Saved and wait about 2 seconds.

This Chart shows the relationship between the cM Threshold selected and the number of Clusters that result (for the Download of my data). Your results may vary, but the shape of the curve will be the same. The curve flattens below a 20cM threshold, because the Shared Clustering uses the Clusters at the 20cM threshold as a base and adds the other, smaller cM, Matches to the Clusters formed at 20cM. The smaller Matches (below 20cM) often have Shared Matches (all of whom are 20cM or higher), but there are no additional Shared Matches below 20cM. Experiment with your Download – it only takes a few seconds to change the Threshold cMs and get a new set of Clusters. NB:The Cluster numbers are uniquely formed during each Cluster report. They do NOT follow the Matches to other Cluster reports – they shouldn’t, because the new Clusters (formed on Ancestors) are different at different generations.

Jonathan monitors the Shared Clustering facebook page, and he’s always been very responsive. It’s good to visit that page and follow the conversations. And ask questions. And request improvement features.

https://www.facebook.com/groups/sharedclustering/

I will try to post soon on my Walk The Clusters Back project, using Clusters that should be focused on different generations in my Tree – very successful.

If I’ve messed up anything in this review of Shared Clustering, I hope Jonathan Brecher and/or other readers will provide feedback in the comments.

 

[19C] Segment-ology: Shared Clustering – A Great Tool! by Jim Bartlett (20191129)

Grouping Matches – Try It!

A Segment-ology TIDBIT

We can group Matches several ways:

  1. Each Triangulated Group (TG) includes Matches who share the same Common Ancestor (CA). This is based on your DNA segment from an Ancestor, which other Matches also share. 23andMe, MyHeritage and GEDmatch all have tools for Triangulation.
  2. Clustering includes Matches who share multiple Shared Matches with each other – they tend to be based on the same Ancestor. The Leeds Method focuses on 4 groups representing our 4 grandparents. This is based on the probability that groups of Shared Matches will probably have the same Ancestor. When the lowest threshold is used (6cM), all of the company Matches are included and the Clusters tend to approximate a one-to-one relationship with TGs. This is a good tool to group our Matches at AncestryDNA and FamilyTreeDNA. I blogged about some Clustering programs here.
  3. We can also form Clusters based on ethnicity, geography, Haplogroups, etc., but, in general, these will not be as precise as TGs and Shared Match Clustering. These Clusters are, however, often very helpful in homing in on a CA.

Groups can help us in several ways:

  1. Everyone in a group should have the same objective: finding the CA. There is synergy in a group; and working together often results in a better outcome. One person’s Brick Wall or bio-Ancestor (vs. an NPE) may be in the Trees of other Matches in the Group.
  2. Close Cousins and their CAs with you provide a beacon toward the more distant CA, and limit the possibilities that would otherwise need to be explored.
  3. Once several Matches in a group agree on a CA, that CA line can be imputed to the other Matches. Many times I have searched a Match’s Tree for a specific Ancestor (highlighted in the Cluster), and found it! I’ve also communicated with Matches with no/small Trees and asked specifically about a surname and gotten positive/helpful responses.
  4. Use Clustering to form groups at FTDNA, MyHeritage and 23andMe, and use them as a basis for TGs – Triangulation goes much more quickly when you only compare segments that will probably Triangulate.

We can form Triangulated Groups at 23andMe, MyHeritage, GEDmatch, and, with a Clustering pre-start, at FamilyTreeDNA – but those companies, generally, do not offer much in the way of genealogy tools, and only a few of the Matches have robust Trees. On the other hand, AncestryDNA has a lot of good Trees, and great tools like ThruLines, but no DNA segment data – however, we can do Clustering. DNA and No Trees; OR Trees and NO DNA – it’s frustrating… So how can we merge the TGs and AncestryDNA’s Clusters?? More on this later…

BOTTOM LINE: We need both Triangulated Segments and Triangulated Genealogy to be in sync (reinforcing each other) before we can have confidence in our conclusions. One without the other is incomplete research.

 

[AQ] Segment-ology: Grouping Matches – Try It!  TIDBIT by Jim Bartlett 20191128