AncestryDNA ThruLines Missing Out

A Segment-ology TIDBIT

ThruLines is based on genealogy – it finds Common Ancestors based on your Tree and the Trees of others. However, it only reports Common Ancestors with your DNA Matches. So, in a sense it has a DNA component. But the connections TL finds are not based on shared DNA cMs, Chromosome location, segment Triangulation, Clustering or Shared Matching – it is based only on connections found through Trees (only on genealogy). And ThruLines only reports Common Ancestors with your DNA Matches.

This is a two edge sword:

  1. If you only want to work with DNA Matches, it’s a good thing.
  2. However, if you are a genealogist looking for cousins who might share records, pictures, stories, analysis, new branches, etc., it leaves something out. Remember that roughly half of our 4th cousins (4C) don’t share DNA with us, and roughly 90% of our true 5C don’t share DNA with us, and the vast majority of our more distant true cousins don’t share any DNA with us. This means that, although a program like ThruLines could find those non-DNA-sharing cousins for us, it doesn’t. Think of all that we are missing – think of all the lost opportunities.

Well… looking back on the #1 cutting edge of the sword – I’ve got to be a happy camper. I’m finding more ThruLines Matches than I can keep up with. By adding children and grandchildren of my Ancestors in my Tree, ThruLines is finding more Matches with Common Ancestors. And these Matches and their Trees are reinforcing my Tree (and pointing out a few soft spots…)

Back to work… Stay safe!

 

[AQ] Segment-ology: AncestryDNA ThruLines Missing Out – TIDBIT by Jim Bartlett 20200326

 

In Defense of Small Segments

Do you remember genealogy before atDNA? Pre-2010?

There was a time when we didn’t know about atDNA segments. We researched records, and looked at other people’s Trees/records. We developed our Trees and found cousins, without any knowledge of whether we shared any DNA or not.

So what’s changed?

We got a great new tool called atDNA that told us who we “matched” based on one or more shared atDNA segments. Each company developed an algorithm and reported Matches based on at least 6-8cM of matching DNA. The concept was that a person who shared a DNA segment of at least the minimum “threshold” size was probably related. Early on we learned that a shared DNA segment of at least 15cM was “always” a true match – it was Identical By Descent (IBD); and those IBD shared segments came from a Common Ancestor (CA) to us and our Matches. We also learned that from the company threshold (6-8cM) up to 15cM, some of the shared segments were false – the lower the cM, the more likely that the shared segment was false. Generally, about half of the 7cM shared segments were true and half were false; 6cM shared segments were false most of the time, and 8-15cM shared segments were true most of the time – we just couldn’t tell which were true and which were false. Some of the companies had other ways to improve the probabilities, but many of the experts admonished us to generally avoid using the segments below 15cM. A huge debate grew up about the use of 6-15cM shared DNA segments.

To get some data on shared DNA segments, Blaine Bettinger developed the Shared cM Project which showed our collective experience in finding cousins with various amounts of shared cM. His chart is in this article. The Shared cM Project showed that many had found 3rd cousins (3C) to 6C with ranges of cMs down to the threshold amounts. And at the testing companies and GEDmatch, we were finding 3C to 8C with shared segments in the 6-15cM range. AncestryDNA reported Circles (with CAs) out to 8C. The genetic genealogy community was finding cousins with these small shared segments – we just didn’t know if the DNA segments were true or false.

We also heard about scientific studies that showed that most of the IBD (true) shared segments in the 5 to 20cM range were from ancestors greater than 10 generations back – at least 8xG grandparents (or 9C level). This is usually beyond a genealogy time frame for many of us. For instance, see the Speed and Balding chart in this article. But even this data showed that within the 5 to 20cM range there were some 3C to 8C.

However, we continue to be admonished to avoid, or discard, Matches in this 6-15cM range. Such small segments were branded as “suspicious”, “dangerous”, “poison”, “a fool’s errand”, etc.

I don’t deny that some of the 6-15cM shared segments are false, and that many of them are beyond a genealogical time frame. But on the other hand, some of them are true and within a genealogical time frame. I’m unwilling to discard all of them, because some of them are false or too distant. As I will show below, many of my Matches with these small segments are very useful.

What’s at stake?

So, before we adopt a hard rule one way or the other, let’s look at small segments from a different viewpoint. At AncestryDNA, I have 120,000 Matches. Their ThruLines (TL) program has identified over 2,000 Matches who share a specific CA with me. The shared DNA segments range from 208cM down to 6cM; and from 2C to 6C. In fact about 2/3 of these TL CAs are with Matches who share 6-15cM segments with me. Based on my 45 years of true genealogy research, I’ve determined that only about 5% of these TL Matches are incorrect (the Matches and I may still be cousins somehow, but not on the CA identified by TL). So… over 1,900 of these TL Match cousins and CAs are ‘keepers”. I don’t want to throw away 2/3 of these easily identified CAs.

This genealogy analysis had nothing to do with the size of shared DNA segments. I believe these 1,900 people (Matches) are my true cousins – even if we didn’t share any DNA! As a genealogist, I’m a happy camper.  I very much want to share records, stories, pictures, research, and other descendants, or maybe test a Y-DNA or mtDNA line, with each of these new-found cousins. Even if I could eventually determine that our shared DNA segment was false, this person is still a cousin.

Most of our true Cousins won’t be DNA Matches

Over half of our 4C wouldn’t show up as a DNA Match; only about 10% of our 5C would show up as a Match; and only a very small fraction of our deeper cousins will show up as Matches. So when someone does shows up as a DNA Match (at any level), and there is a valid paper trail showing they are an 8C – why not accept that? At least accept the 8C part, if not the DNA link. Later, in Triangulated Groups or Clusters, we’ll see if that person “groups” with others on the same line. This would indicate to me that the genealogy was true.

Between 1974 (when I started researching genealogy in earnest) and and 2010 (when atDNA testing became available), I found many cousins with no knowledge of any shared DNA. Some of them probably shared DNA with me, but most would did not. But they were all my cousins.

I hope I’ve made two key points so far: 1. atDNA is just a tool we’ve used over the past 10 years – it’s not our master; and 2. atDNA does not find everything in genealogy – we have many cousins, and indeed many Ancestors, we will never find with shared atDNA.  Ponder these points for a moment….

So back to small shared segment (6-15cM) Matches – are they worth it? Well as discussed above, of course they are!

Are they useful in Genetic Genealogy – beyond just as cousins? I think the answer is often they are useful… Let’s look at a few situations.

ThruLines and Clusters

Suppose, using ThruLines at AncestryDNA, you found 20 Matches in the 6-15cM range, who were all cousins (3C to 6C) on a line back to a 5xG grandparent couple. [NB: I have 64 5xG grandparent couples, and over 2,000 TL Matches – an average of 30 TL Matches (with a CA) per couple, so 20 TL Matches is a reasonable number]. At AncestryDNA we don’t have shared segment info for Triangulation, but we can do Clustering. Let’s Cluster on a 6cM threshold (all my 120,000 Matches, including the 1,900 good TL Matches). If the above 20 6-15cM Matches were sprinkled all over the Matrix (in different Clusters) – then nothing special. But if 11 Matches (of the 20) are in one Cluster, and 6 are in another Cluster, I’d sit up and take notice! There is nothing “random” about that. Clusters are formed on Common Ancestors, so we’d expect to see most of these 20 TL Matches in a Cluster, or two, or three. I have mostly Colonial Virginia ancestry, and some of my Matches have multiple CAs – so some of the 20 TL Matches may well wind up in a different Cluster. But, whenever your Matches form a strong* group (Cluster, Triangulated Group, DNA Painting, etc), they are very likely to have the same CA and share IBD segments. At least this is a good hypothesis.  At this point I am not claiming a “proof”, but I am claiming a lot of evidence that points in one direction. [*strong group does not mean 2 or 3 Matches in a Cluster; nor 10 Matches in a Cluster, but each one only matches 2 or 3 others. A strong group would be 10 Matches in a Cluster with each one matching about 8 of the others. Use judgment here.]

Finding more CAs in Clusters

In the big picture, all of our Matches can be divided into two groups – those with true shared DNA segments, and those with false DNA segments. I believe most of my 120,000 Matches at AncestryDNA have true shared DNA segments with me (although as outlined above, I don’t really care if some are not DNA cousins: since AncestryDNA doesn’t show shared DNA segment info, I cannot Triangulate them anyway). Therefore, if Clustering groups these Matches, I have every reason to believe they are valid when they point to the same ancestral line. And if some of those Clustered Matches (with a CA on the same ancestral line) have small (6-15cM) shared DNA segments with me – so what? It’s close enough for a second look. Recently I’ve been looking through my Clusters for Matches who have over 1,000 people in a Public Tree. Most of my Clusters have a hypothetical Ancestor in them, so I look for surnames in that line.  Sometimes, I find a clue and I’m able to build the Match’s Tree out to connect with my line. This adds even further evidence that this Cluster is based on that line.

Genealogy vs Genetic Genealogy

Another aspect of this whole discussion is genealogy vs. genetic genealogy. If you are just interested in genealogy, it doesn’t matter what the size of the shared DNA segment is. In fact, while looking at Hints, I run across a lot of helpful Trees, where the owners are not DNA Matches at all. Only in certain circumstances (Chromosome Mapping; bio parents/Ancestry; “proof” where genealogy records are insufficient; etc.), do you need to insure a shared DNA segment is true (IBD) and cannot be from a different Ancestor. So, unless you need to “prove” a genetic link, don’t worry about the size of the shared DNA segment. There is a lot to learn from many of your DNA Matches, even those with small segments, and even from other people (with no DNA match) at Ancestry.

Breaking through a Brick Wall

Even breaking though a brick wall is primarily a genealogy exercise. To be clear, this process is often aided by starting with a group of DNA Matches (Painted, Clustered and/or Triangulated), and looking for Matches with Trees that have Common Ancestors among themselves – beyond where your brick wall is. In these cases you are using DNA Matches who are probably related to you and who group with other Matches. You use this cadre of Matches to find a CA among them. This is basically a genealogy exercise – and, again, it doesn’t make a lot of difference how much shared DNA you have. In fact, to find a CA beyond your brick wall, you are probably looking for a distant CA – often found with smaller DNA segments. So don’t discard those Matches with small segments who have a Common Ancestor with you – use them.

Use caution with isolated small segments

My discussions above about using small segments is in the context of clues and grouping (Painting, Clustering, Triangulating, etc). IMO, it is reckless, and wrong, to find a 9C Match, sharing 10cM, and declare that “proves” the Ancestral line by itself. Such a “find” is one clue (and by itself, a very shaky one), and much more corroborating evidence is needed even to form a hypothesis. The “rule-of-thumb”, I’ve been using is to have at least G independent Matches (at least cousins to each other) who all agree on the same Common Ancestor – were G is the number of Gs of the Common Ancestor. At the 7xG grandparent level (9 generations back – 8th cousin level) this means 8 Matches in agreement.  It’s relatively easy to get that many Matches in a Cluster or Triangulated Group – it’s much harder to find Common Ancestors with each of them. So be sure to include those CAs from Matches who share small DNA segments with you!

Bottom Line

Use as many of your DNA Matches as you can, to learn more about your own genealogy. IMO, Matches with small shared DNA segments often provide the clues and evidence you are looking for. But use extreme caution with small shared DNA segments in isolation – they are much more credible when they are part of a group. Small segments in context and groups can be very helpful.

 

[06C] Segment-ology: In Defense of Small Segments by Jim Bartlett 20200131

20200202: Edited 10 paragraph to change “DNA segments” to “genealogy”

How Many TGs From Distant Ancestors?

I was recently asked if I’d thought about this question. The quick answer is YES – the answer to this question is at the core of my belief that genetic genealogy is valid out to 9 generations back. And I think this question is really two questions: one about the Triangulated Groups (TGs) themselves; and one about the Matches with shared DNA segments within each TG.

How far back do our TGs go?

Using a 7cM threshold for shared DNA segments, I’ve documented 372 TGs, covering over 98% of my DNA. These TGs have natural breaks [recombination crossover points] between them. These TGs represent actual DNA segments, on my chromosomes, which are from my Ancestors down to a parent to me.  So how far back do they probably go?

The number of segments we have at each generation of our ancestors is fairly easy to estimate. Using a female to make it easier, she gets 46 segments from her two parents – in the form of 46 chromosomes. Pretty big segments…  Using the average recombination rate of 34 crossovers per genome (per parent), she would get 68 additional segments one generation back. In other words she would have a total of 46+68=114 segments from her grandparents. And she would get 114+68=182 segments from her Great grandparents.  Here is a handy table I made up for my reference:

This table starts with me at the bottom and shows the generations back, the number of Ancestors at each generation back, the generic name of those Ancestors, the relationship of my cousins who share a Common Ancestor with me at that level, the calculated percentage and cM amount of DNA I got from each of those Ancestors (at any given number of generations back), the calculated average number of segments in my DNA from all the Ancestors in any given generation, the average cMs per TG; and in the last two columns the average and range of cMs collected in Blaine’s cM study. The first column is just for a very rough estimate of the birth year of my Ancestors at any given generation (it helps me).

Highlighted in yellow is the 386 segments expected (roughly) from my 3xG grandparents. That’s roughly the same as my 372 TGs. So I expect some kind of distribution curve around that point. Matches who share the full DNA segment represented by a TG would probably be 4th cousins (4C). Due to the random nature of DNA, I expect a range from 2C to 7C or 8C. My TGs range in size from a few just over 7cM to some around 50cM – it all depends on several variables.

Another aspect of this discussion has to do with what I call “sticky” segments. Per the Table above at 5 generations back we would see 386 segments – or 386 TGs – of about 18cM each. But going back one more generation – one more round of 68 crossover points would result in 454 segments. This means that 64 of the 386 segments were subdivided, and 322 segments were not! This means that 322 segments (TGs) were passed down intact (no recombination). The effect of this is that many TGs will persist, at the same size, for several generations. We could well see the same size TG from a 6xG grandparent to a 5xG to a 4xG to a 3xG grandparent. So it would be possible for a 7C, 6C, 5C and 4C to all share the full size DNA segment represented by the TG. Clearly the probabilities of that decrease as the cousinship increases.

Bottom line from my experience: I think we’ll find most of our TGs to be within a genealogical time frame of, say, 9 or 10 generations. And there is always the opportunity for closer cousins to share a DNA segment within any of our TGs.

How far back do the Matches go?

This is a different, but related, question. The above discussion was all about the full DNA segment represented by a TG. Most of our Matches in a TG will not share the full DNA segment. They overlap us or are wholly included within the TG segment. For example, the Matches in 20cM TG can range from sharing 7cM up to 20 cM. And, in fact, some of our closer cousins may share 35cM and span across more than one TG. It’s very random. However, to the point of the question – many of our Matches who share, say, 7 to 15cM may well be cousins beyond the Ancestors who passed down the full TG. To be sure, the Common Ancestors in this case would be ancestral to the TG Ancestor, but it could be 10, 20, or more generations back.

Bottom line: Matches in a TG are limited to a narrow range of your Ancestors, but they are not limited by how close or how distant they could be. And Matches who share small segments may well be beyond a genealogical timeframe; but some will be within a genealogical timeframe. Witness the Ancestry ThruLines Common Ancestors down to 6cM.

Summary: I think most TGs will be within a genealogical timeframe (using a 7cM threshold for shared DNA segments). The Matches in a TG will range from close Matches, out to Matches on the fringes of our genealogy and on out to Matches who will be beyond our genealogy.

 

[19H] Segment-ology: How Many TGs From Distant Ancestors? By Jim Bartlett 20191217

A Unified Theory of Genetic Genealogy

Bottom Line Up Front (BLUF):

Triangulated Groups = Clusters = Common Ancestors

Brief overview: Each of us has a specific genealogy Tree of Ancestors; and a fixed arrangement of our DNA segments from those Ancestors. I believe our DNA segments are reflected in our Triangulated Groups (TGs) of shared DNA segments, which are from specific Common Ancestor (CAs), and that each CA is represented by a specific Cluster of Shared Matches. I believe there is alignment between the TG CA and the Cluster CA, which can be very helpful. Put another way, each of our Ancestors will have a specific TG/Cluster combination, and at some point in our Tree there will be one TG and a corresponding Cluster for each Ancestor.

TRIANGULATED GROUPS

In these blog posts I’ve often stated that each segment Triangulated Group (TG) is from a specific Common Ancestor (CA) – in other words the DNA segment identified by a TG came from a specific Ancestor down the line of descent of your Ancestors to you. The Matches in a TG will be relatives (usually cousins) along one of your ancestral lines.  For example, if a TG is from a 6xG grandparent (7th cousin (7C) level), some of the Matches may be cousins from 1C to 7C; and some may be from Ancestors beyond the 6xG grandparent – perhaps (usually with shared segments below 15cM) somewhat beyond the 6xG grandparent.

Because of the random nature of DNA, and the wide range of cMs for cousins beyond 3C, there is no set of parameters (short of complete chromosome mapping) that will get you only TGs at one generation. For instance, I know of no cM parameter that will get you only TGs at, say, the 6C level – or any other level. So we usually wind up with a mix of TGs at different cousinship levels.

TG Outliers

Like with most things DNA, there may be some outliers, and not every Match in a TG will be found to share an IBD segment (in other words, some Matches with small shared DNA segments  – under 15cM – may be false Matches). But the important take-away is that the TG will represent a CA, even if a few Matches are false.

TG Bottom Line

Your DNA has fixed crossover points. Depending on the cM threshold you use for comparing shared DNA segments, your data will have natural break points between TGs. I used a 7cM threshold and got 372 TGs covering over 98% of my 45 Chromosomes. It was hard work doing all the comparisons and culling out the false shared segments. I have Matches who are 2C to 9C for about 80% of these TGs.

SHARED MATCH CLUSTERS

Recently, I’ve been blogging about Clustering. Clusters appear to come from a specific CA down the line of descent of your Ancestors to you (just like the description of a TG).

When I did a Clustering run of all my 5732 Matches at Family Finder, I got 352 Clusters which had a very high correlation to my 372 TGs.

Well… duh! When we consider that each of us has fixed segments in our DNA and fixed ancestors in our Tree, we understand that each of us, in our own unique ways, has a specific “solution” (Ancestors linked to DNA segments). So if we look at grouping by Clusters, it should reflect that “solution”. And when we form segment TGs, they should reflect that “solution”. And in combination, the Clusters and TGs should reflect the same “solution”.  In other words the Clusters and TGs should align.

In my opinion, Clustering with Shared Matches is a sophisticated way of grouping Matches based on the probability that a number of Shared Matches who mostly match each other, will be from the same CA.

Clustering Outliers

Like with most things DNA, there may be some outliers, and not every Match in a Cluster will be found to share the same CA. But the important take-away is that most do share the same CA, and the Cluster will represent that CA, even if a few Matches don’t.

Cluster Bottom Line

Your Shared Matches will tend to Cluster on CAs. Depending on the cM threshold you use for comparing shared DNA segments, your data will divide into different numbers of Clusters. See my experience here; and the process here. I used a 6cM threshold and got 350 to 382 Clusters, covering at least all of my 4xG grandparents and some out to 8xG grandparents. It was relatively easy to run the Cluster programs to get the Match/SharedMatch data, and relatively little work to determine a consensus of a CA for each Cluster, for each run at different cM levels (smaller thresholds result in more Matches and Clusters, and more work). I can see CAs out to 8C for some Clusters. [NB: Clustering does not find the CAs – this is homework you have to do before Clustering: find as many CAs as possible and put that information in the Notes, so it’s available for analysis at each Cluster run].

ACTION – USE CLUSTERS to form TRIANGULATED GROUPS

I’ve spent a lot of work over the past 8 years determining my 372 TGs (your number of TGs may vary, but I believe using a 7cM threshold for Shared Segments, it come out at this order of magnitude). Triangulation, even with the tools at 23andMe, MyHeritage and GEDmatch takes time and work. In contrast, Clustering is relatively simple – pretty close to a “click” process. If Clusters are the same as TGs, we should be able to run a Cluster report on all of our Matches (at a company which also provides segment data), and then easily sort on the DNA segment data (sort by Chr and Start), and then relatively easily scroll down the several thousand Matches and group them into TGs. Yes, this scrolling will take some work, but it’s a whole lot easier than comparing each shared DNA segment pair in a browser. I believe the combination of Cluster numbers and segment data will easily define the TGs – maybe just a little “quality control” at the end, depending on how the data looks.

I have my brother’s DNA at FTDNA and 23andMe – I’m going to try this process on his results, and will report back.

The Bottom Line

Once you determine your TGs and the CAs that go with them, you have a Chromosome Map!

My Bottom Line

I’m trying to demonstrate:

  1. TG=CL=CA
  2. The CA will be in the 7C-9C range*

 

*I recognize that my belief that our DNA tests can accurately determine our CAs out to 8C, or so, is not held by most genetic genealogists. But based on my experience, particularly using Walking The Clusters Back, I believe this is a realistic range – easily and accurately obtained – and confirmed by both TGs and Clustering.

With our fixed Ancestry and DNA crossover points, each process should give us the same “solution” – whether we use DNA Painter, Kitty Cooper’s Chromosome Mapping, GenomeMatePro, Visual Phasing, Double Match Triangulator, etc., etc. We are just using different tools to “see” the chromosome map.

 

[19G] Segment-ology: A Unified Theory of Genetic Genealogy by Jim Bartlett 20191216

Walking The Clusters Back III

Progress Report – Observations…

Main benefits, so far:

  1. Impute Cluster Common Ancestor (CA) to other Matches in the Cluster – this let’s us focus on individual Matches – look at their Tree with a CA in mind, and/or communicate with the Matches and ask about a specific Surname or Ancestral line.
  2. Compare Cluster CA to ThruLines CA – if the same, we have reinforcing evidence; if different, the ThruLines CA may be wrong, or it may be correct genealogy, but the Match has another CA linked to the DNA (and the Cluster).
  3. Link some Clusters (and the CA) to a Triangulated Group (TG) – this will strengthen the evidence of the Ancestral line of a TG. Often the Cluster CA is more distant than the CA found in TGs at 23andMe, FTDNA, MyHeritage or GEDmatch.
  4. As the threshold decreases, there are more Matches included in the Clustering process, and those Matches tend to have more distant CAs with us. Clusters will start with only 2C and 3C; and grow to include 4C and 5C, etc. We can see the Walking The Cluster Back happening within each Cluster. Eventually each Cluster will begin to show several generations of CAs – they should all be on the same Ancestral line [if not, check with the correlated Clusters]
  5. Clustering reduces the range of possibilities. If a Cluster has a CA of A18 [Ahnentafel number for a specific 2xGreat grandparent = father’s father’s mother’s father], there are only two possibilities for the next generation: A36 and A 37 (although a Match may share a CA another generation back: A72, A73, A74, or A75). If a new Match in the next (lower threshold) Cluster run has CA = A74 – this is reinforcing evidence. If the new Match has CA = A88 – something is amiss [check for another CA, check for a correleated Cluster which is A44, or A176, etc.]

Main issues, so far:

  1. It’s been hard to specifically find Clusters which split into two Clusters a generation further out. Many Clusters have included CAs which span several generations on the same line. I’m inclined to “go with the flow” [accept the Clusters with CAs on the same line]; and not try force Clusters into a “genealogy Tree” structure. The data is just too variable. Maybe when I get down to a 6cM threshold, it may play out that way – but, I have a feeling the vagaries of random DNA will make that a wild goose chase.
  2. Several BIG variables combine to give us trends, rather than a uniform picture/pattern:
    1. Our Ancestry (size of families, documentation, probable NPEs at some level, etc.)
    2. Our random DNA from different Ancestors
    3. Which of our cousins have DNA tested.
  3. Homework is needed – Clusters can be formed, but some genealogy is needed to identify the CAs. I recommend building a Tree of Ancestors out 7 generations, wherever possible – with that AncestryDNA will find ThruLines CAs for you. Enter those CAs (or their Ahnentafel) into the Match’s Notes, so that information will be available in the different Clustering runs.

My status so far is summarized in this Table of different Cluster runs:

The first column shows the decreasing thresholds I used (basically every 5cM) – the top line is the original download: 6cM threshold, 119,068 Matches (and all their Shared Matches) which took 9 hours and is in a .txt file.

The # Matches and # Clusters are for the various cluster runs – which take negligible time to produce an Excel Cluster report.

The 3C, 4C, 5C, etc column show how many Clusters I got with CAs at those levels. (There were some 2C, but they were in Clusters that also had 3C – I counted each Cluster with the most distant cousinship which had a consensus.)

The larger threshold Clusters had multiple TGs in them. Beginning about at the 35cM threshold, some of the Clusters started showing a single, or consensus, TG – so I counted them.

Starting after the 45cM threshold, the number of included Matches about doubled with each decrease of 5cM in the threshold, and the number of Clusters began increasing dramatically. This means the amount of work for scrolling down the entire report, analyzing the data in each Cluster and determining the consensus, also increased a lot. Sometimes the CA and/or TG of a Cluster is very clear; sometimes a Match’s correlated Clusters must be reviewed and the Match assigned to another Cluster. And all new Matches need to be “Tagged”; and often other Matches need to have their “Tag” adjusted [what I alluded to in the Iterative WTCM Process], as new Matches and their new information are added to the Clusters.

The good news is in the TG column – where about 1/4 of the Clusters (and CAs) can be linked to TGs [I have TGs for over 300 Matches at AncestryDNA].

More good news: The Shared Clustering program, below 20cM, will first Cluster on the 4,515 Matches, basically retaining the 382 Clusters constant, and then go back and add in the new Matches. Therefor, all the Matches below 20cM (including about 300 with TGs, and over 1,500 with CAs) will be added to the existing Clusters, and very probably push the Cluster CA out even farther and add the a TG to many of them.

Note the trend in the 35, 30 and 25cM Cluster runs to more distant cousinships. As I find the time to analyze the 20cM Cluster run, and then runs at 15cM and 10cM, I expect this trend to continue, giving me many more CAs in the 6C, 7C and 8C range. Of course these are all clues, but I believe they are very strong clues. Time will tell as I investigate each Cluster/CA/TG more deeply.

 

[19F] Segment-ology: Walking The Clusters Back III by Jim Bartlett 20191214

Walking The Clusters Back II

I’ve found it difficult to hit the 8-16-32-64 Cluster targets. Even when I come close , they don’t wind up all in one generation. The DNA is just too random. Just look at Blaine Bettinger’s cM charts to see that there are wide cM ranges for most of the cousinships. Therefore, there is not a magic threshold for any given generation.

A better plan for WTCB is to increment the cM range a little, say 5cM, and examine the new array of Clusters. Where did the Matches from the previous Clusters go? Look for cases where the Matches from one Cluster are now split between two Clusters – almost certainly these two Clusters will represent the parents of the Cluster that held all of the Matches before. As you lower the threshold incrementally, expect the Clusters formed on close Ancestors to disappear, and new Clusters to form on more distant Ancestors. Trace the Matches from Clusters that disappear to their new Clusters for very strong clues of the ancestral line. Once you are confident of the Common Ancestor (CA) of a Cluster, there are only two options for the next generation – the two parents of the Common Ancestor previously determined. And, with the lowering of the Clustering cM threshold, new Matches will be added to the mix. These new Matches have (incrementally) smaller shared segments with you, and, in general, will tend to be more distant cousins. To be sure, each new batch of Matches (as you lower the threshold each time) will probably include a range of cousinships with you. Each Cluster CA is a hypothesis, and as new evidence (new Matches) is added to the mix, everything needs to be reviewed for consistency, and discrepancies resolved. Sometimes a discrepancy is resolved by moving the Match to a correlated Cluster.

Those who try this process are encouraged to provide feedback in the Comments to this blog.

 

[19E] Segment-ology: Walking The Clusters Back II by Jim Bartlett 20191205

Walking The Clusters Back

A Segment-ology Concept

Overview

Walking The Clusters Back (WTCB) can be a fairly complex process, so let’s start with an overview of the concept.

Pick a Clustering threshold high enough to give us 4 Clusters – one for each grandparent. Tag each Match in these Clusters to the appropriate Ancestor (grandparent). Then adjust the threshold to get (roughly) 8 Clusters (one for each Great grandparent). These Clusters would include all the Tagged Matches who would indicate which grandparent line each new Cluster was in; as well as new, generally more distant Matches, who would then separate these Clusters into Great grandparents. Tag, or re-Tag, each Match in these Clusters to the appropriate Ancestor (Great Grandparent). Then lower the threshold to get 16 Clusters and repeat the process.

Up Front DISCLAIMER: This is not a simple click-and-done process. There is homework to be done before Clustering:  documenting the Common Ancestors in our Matches’ Notes. Although the time it takes to run various Cluster reports is only a few seconds for each one, it takes some time to analyze the Matches in each Cluster and come to a consensus, then transfer those clues to the next Cluster, and then analyze those Clusters. It turns out this WTCB process is iterative –two steps forward, then one back. Each new set of Clusters brings in new Matches with clues that need to be reconciled. WTCB is  somewhat easier than Triangulating all your Matches, but it is still time-consuming work. Nevertheless,  WTCB is a great opportunity to get the most out of your Matches at AncestryDNA.

Background

Clustering is a way of grouping your Matches. Each Cluster tends to group Matches who descend from the same Ancestor. The Leeds Method groups close-cousin Matches into four Clusters which are usually our four grandparents. On the one hand this depends on knowing the Common Ancestor with some of the Matches; and on the other hand it provides a strong clue about the Ancestor of other Matches in a known Cluster. And if some Clusters are known, the others may be determined by logic. Clustering provides a powerful grouping tool. I posted about Grouping Matches here; and about several Clustering programs here.

The Leeds Method uses a high threshold (90-400cM) for the Matches to be included in the analysis, so that only 2nd or 3rd cousins are used. Each cousin in this range would usually be from only one of our four grandparents. What would happen if we lowered the threshold just enough to only have 3rd or 4th cousins most of the time? Generally they would tend to form eight Clusters – one for each of our eight great grandparents. However, as with most things “DNA”, as we decrease the cM threshold, we get a wider range of relationships – it’s not very probable that we could find a cM threshold that would produce exactly eight Clusters, or even succeeding in that, that there would be a 1-to-1 relationship to our eight Great grandparents.

And if we decided to jump to the ultimate and Cluster on 6cM, I can tell you that we’d get hundreds of Clusters. Some of them we might be able to identify, but most will look like “mush”. And, like finding a DNA Match who is a 9th cousin, we wouldn’t really have much in the way of corroborating evidence.

However, we often do have a lot of data to work with, and a good tool like Clustering that makes it fairly simple to group our Matches…

It struck me that maybe we could Walk The Clusters Back (WTCB). See the concept in the Executive Summary. In theory the 8 Clusters would be husband/wife pairs – the parents of the grandparents in the original 4 Clusters. The 4-Cluster Matches would carry a “tell-tale” Tag of whence they came to the 8 new Clusters. Then, hopefully, with clues from the new Matches in the 8 Clusters, we could determine which of the great grandparents each of the 8 Clusters represented. We are down to only two options for each Cluster, and if we can figure out one, the other would determined by logic: i.e. the other parent.

In general this worked! But it didn’t always work per the theory (the DNA is random), and the process was arduous.

Problems

– Even by adjusting the Cluster threshold 1cM at a time, the Clusters rarely came out to 8, or 16, or 32, or 64, or 128. And even when it came close, all of the new Clusters were not necessarily just the parents of the previous Clusters. Sometimes one Cluster would split into 3 or 4 Clusters (a parent and 2 grandparents, or 4 grandparents). And from one iteration to the next, some Clusters didn’t split at all. As the threshold dropped and the number of Clusters increase, the deviation from the theory increased. Not wholesale, not all of them – but enough to notice.

– In my case (one grandparent who had very few Matches, and thus very few Clusters), I knew I would get only 2 Clusters max for that one grandparent. If you have a special case, you need to make some adjustments (I used 4-8-13-26-50-98 as my target Cluster numbers and took whatever was closest).

– Sometimes in a Cluster, I didn’t get any/many Matches who had a CA. Each new Cluster depends on Matches with CAs to inform us about the probable CA for the Cluster. I could usually make up for this in the next iteration, but that meant I had to backtrack – which I did more and more as the number of Clusters increased.

– I had to find a way to transfer the Ancestor tell-tale Tags from Matches in one set of Clusters to the Matches in the next Cluster run. More on this later.

Developing the process

In step 1 of the above figure, all the Matches above a 90cM threshold are Clustered into 4 groups – one for each grandparent. In step 2, based on the known genealogy of some of the Matches, I determined the 4 grandparent Ancestors based on information I know about the close Matches in each Cluster. In step 3 I adjusted the threshold to create about 8 Clusters, and noted which of the Matches from steps 1 and 2 are now in the new Clusters. Generally the Matches from each of the 4 Ancestors (grandparents) in step 2 will be found in only 2 Clusters at the step 3 level. There are only two options for these two Great grandparents, the husband and the wife of the Ancestor in step 2. Often at this point, we don’t always know which is the husband and which is the wife, but we can be confident it’s one of each. However, during step 3 we also get additional Matches (not shown) in the Great grandparent Clusters. Some of these new Matches may be 4th cousins who would provide insights on the identity of the Great grandparents. The other Great grandparents will become known after step 4 as shown below.

In step 4 above, we lower the threshold again and get roughly 16 Clusters. I’m just using a single green arrow to show all the new Matches who were 4th, 5th and 6th cousins who formed the 16 Clusters (along with the closer red arrow Matches from steps 2 and 3). At this point, all we’d need is the CA of some of the more distant Matches, in order to determine the correct Ancestors at level 3. Note that two of the green arrows don’t point to one of the 8 level 3 Ancestors – don’t worry about it. These “errant” Matches will often fall into Clusters in the next iteration. Or maybe they will continue to be strange. Again, don’t worry about it – focus on the Ancestors you can determine. Some of the data will get a little messy as the threshold is dropped. Focus on the positive outcomes – the identity of the Cluster Ancestors at each generation.

Step 4 above shows that this is an iterative process: we sometimes need the CA couple information from one generation to resolve the individual CA of a closer generation. Note that this changes the Tags, which needs to be reconciled in a prior generation Cluster.

The Iterative WTCB Process

This led me to modify the theoretical process at the beginning of this post to the following iterative, or zig-zag, process.

  1. Set the threshold to get 4 Clusters; Assign each Cluster to a grandparent Ancestor; Tag each Match in 4-Cluster with the appropriate Ancestor.
  2. Reduce threshold to get 8 Clusters [with Tagged 4-Cluster Matches plus additional new Matches (some with CAs in Notes)]; as best you can, assign Clusters to Great grandparents; Tag all Matches with what you know so far.
  3. Re-run 4-Clusters, to insure the new Tags are OK, adjust as necessary.
  4. Re-run 8-Clusters, and adjust as necessary.
  5. Reduce threshold to get 16 Clusters [with previously tagged Matches plus additional new Matches (some with CAs in Notes)]; as best you can assign Clusters to 2xG grandparents; Tag all Matches.
  6. Re-run 8-Clusters; usually some new Tags will clarify any Great grandparents that were unclear in Step 2. Re-Tag as appropriate.
  7. Re-run 16-Clusters and adjust as necessary.
  8. Repeat steps 5, 6 and 7 (adjusted to for the next round): Reduce threshold/assign/Tag; revisit the prior Clusters/Re-Tag as appropriate; re-run the current Clusters/adjust as necessary.

This 2-steps forward then 1-step back process is necessary because it’s not always clear when one Cluster splits into two new Clusters which is the father and which is the mother – this usually becomes clear in the next round which includes more distant cousins. If something doesn’t work out in one round, it will probably get resolved in a subsequent round. Until, of course, you run out of sufficient data. Theoretically, each of your Ancestors with ThruLines Matches will be incorporated into a Cluster. Generally that would result in a lot of known Clusters. And if some of your Matches uploaded to GEDmatch or tested at one of the other companies, you’d also have TG information included in the Clusters. Happy days of DNA Painting or Chromosome Mapping!

Some Additional Items

Homework before WTCB

Create Notes in AncestryDNA for as many of your Matches as you can (hopefully you’ve been doing this all along). See my previous blog posts about AncestryDNA Notes: Format; Using Notes; ID for CAs; ID for TGs. These Notes are then handy and invaluable in the WTCB process – they provide the clues that let you determine a consensus in a Cluster. They remind you of the CA and TG information you’ve already gathered.

Work with generations

It would be possible to start with a large cM threshold to get 4 grandparent Clusters, and Tag the Matches. Then decrease the threshold by 1cM and run a new Cluster report. Has any new Cluster been added? If no, then decrease the threshold by another 1cM and repeat. If yes, it’s probably the result of a split of a previous Cluster into two parents. Identify the parents, and Tag the Matches appropriately. Then decrease the Cluster threshold by 1cM and repeat. This would take a lot of work.

The process I choose was to lower the Cluster threshold by enough to create roughly twice the number of Clusters. This would basically be creating the next generation of Ancestors. The random DNA doesn’t allow this to work perfectly, but it does tend to subdivide previous Clusters into new Clusters. Focus on identifying (through Tagged Matches) which Clusters came from previous Clusters; and then identifying the next generation of Ancestors in these new Clusters. Again, this doesn’t work perfectly each time. But don’t worry about it, a subsequent set of Clusters (with new Matches and new information), will usually provide resolution (through the iterative process – see below).

Tagging Matches

The key is to Tag Matches and carry over this information to the next set of Clusters. Now in this new set of Clusters you have Matches with information carried forward, plus new Matches, some of which have known information of their own (available in the Notes). Again, your focus is to achieve consensus in each new Cluster, and re-Tag the Matches..

I use the Ahnentafel number as the Tag. Other options include: Ancestor name, Ancestor initials, or Ms and Ps (e.g. MP for maternal grandfather). The Cluster number changes with each different iteration, so don’t use that.

One way would be to add the Tag into the Shared Clustering spreadsheet Note field for each Match; then use the Shared Clustering program to upload the Notes back to AncestryDNA and also to the Download file (where it would be available for the next Cluster run – remember the Cluster runs on the Download file only take a few seconds).

I choose to use the Shared Clustering spreadsheets (after stripping out the colorful Clusters – keeping just the data). I combine two Cluster files, sort on Match name, and copy the Tag from one Match to the other. This is relatively easy for small Clusters, it gets more time consuming with each iteration.

An iterative process

WTCB is definitely an iterative process. When we add new generations of Clusters and find new Matches with clues, we need to then backtrack to previous Cluster runs with this new information. Why backtrack? Because with each succeeding generation of Clusters we are (roughly) adding two “parent” Clusters for each one we had before. Sometimes we don’t have enough information to distinguish which of these two “parent” Clusters are the father or the mother. But in the next Cluster iteration, we find new information among the new Matches who are added in each round. When we backtrack with that information, and designate one of the two “parent” Clusters as, say, the mother, then we can impute the other one to the father – and then further Tagging all the Matches in those two Clusters, and all subsequent Cluster runs.

All the Clustered Matches are valuable

At first I intended to cull out the Matches for which I had no Notes – they didn’t appear to add any value. They didn’t reveal a TrueLines CA or a TG or anything else – nada. But then I realized they did add value – they were part of the heat in the heatmap. They added the value of being assigned to a Cluster (because of their Shared Matches), notwithstanding the fact that I didn’t know anything specific about them. In the next iteration of Clustering (with a lower cM threshold) they would also be included. If I Tagged them per the current Cluster, this information would carry over to the new Clusters. In theory (and borne out in practice), they tended to divide between two Clusters in the next iteration. Of course they couldn’t help me figure out the CA of those two clusters, but the fact that they helped form the Clusters gave weight to the new Clusters. Sometimes they were the Matches needed to actually form a new Cluster, and without them I wouldn’t get that Cluster. And their Tag told me I had only two possibilities for these new Clusters – the two parents of the Tag. And in these new Clusters there were new Matches – sometimes ones who were ThruLines Matches with a CA, or a Match (with no genealogy) who had uploaded to GEDmatch, and thus had a known TG. And sometimes I got nothing from these new Matches, but then did find new clues in the next generation of Clusters.

The value of Common Ancestors with low cM shares

Some of my Matches with CAs (from ThruLines out to 6C or Circles out to 8C) have smaller shared segment cMs – all the way down to 6cM. I treat these as clues – they can be helpful in developing a consensus with other evidence. With this WTCB process, we only have two Ancestor options at each generation, so even a 6cM Match may be a valuable clue.

Two kinds of Imputation possible

Cluster Imputation: When one of two “Parent” Clusters can be determined (usually by a  Match who has a known CA); the other “Parent” Cluster can be imputed to be the other Parent. Then all of the Matches in both “Parent” Clusters can have their Tags adjusted appropriately.

Match CA Imputation: Cluster CAs (or the ancestral line) can be imputed to all the Matches in a Cluster. On many occasions, with Clusters with a strong Match consensus for a CA, I’ve gone to other Matches in the Cluster looking for that CA and found it.

How Far Can We Go?

Continue this out as far as you want? Well, it’s not quite that easy. There are several things at play here:

  1. As noted above, the lower the cM threshold used, the wider the range of relationships we’ll get. As the threshold drops, we’ll see a wider range of Ancestors for each Cluster.
  2. Some of our Match-cousins share multiple relationships with us. Just look at your ThruLines Matches to see the number of them who share more than one ancestral couple with you. At a 20cM threshold, 65 of my 296 Matches (22%) share more than one pair of Common Ancestors with me.
  3. Some of our Match-cousins share multiple DNA segments with us. This means those Matches could share multiple Ancestors with us. Which one should we use? At a 20cM threshold, 1744 of my 4506 Matches (almost 40%) share more than one DNA segment with me.
  4. However, if a Cluster with Matches at one level, splits into two Clusters, each with some of those same Matches at the next level, it’s fairly safe to use that information as a strong clue (or hypothesis).
  5. We now have ThruLines at AncestryDNA. I have over 1,800 Matches in my ThruLines, and they “cover” all of my known Ancestors out to my 5xG grandparents (6th cousin level). Those that wind up in Clusters, provide valuable clues about the Ancestor for those Clusters. Couple that with the Matches with “Tags” from previous Clusters, and you have reinforcing (or conflicting) clues.

 

Here is a Table of my Cluster iterations – I used the highlighted ones in this study.

SC = Shared Clustering program by Jonathan Brecher (used for this WTCB analysis)

Concluclusions:

  1. It is possible to Walk The Clusters Back. I think the trickiest part is assigning Tags to Matches that stay with the Match in succeeding Cluster runs. I plan to try using the Shared Clustering Upload program for that (upload to Ancestry and the Download file)
  2. WTCB is not a simple “click” process – it involves homework (CAs in AncestryDNA Notes), and judgment, logic and time working with the Cluster iterations.
  3. It gets harder (both in logic and the number of Clusters and Matches involved) with each iteration.
  4. Some of my Matches have uploaded to GEDmatch or tested elsewhere and I have TGs for them. WTCB will provide a strong clue for the CA of these TGs.
  5. I think the realistic limit will be around 7xG grandparents (8th cousin level).
  6. WTCB helps us impute CAs to Matches.

 

[19D] Segment-ology: Walking The Clusters Back by Jim Bartlett (20191201)