About jim4bartletts

I've been a genealogist since 1974; and started my first Y-DNA surname project in 2002. Autosomal DNA is a powerful tool, and I encourage all genealogists to take a DNA test.

Walking The Clusters Back III

Progress Report – Observations…

Main benefits, so far:

  1. Impute Cluster Common Ancestor (CA) to other Matches in the Cluster – this let’s us focus on individual Matches – look at their Tree with a CA in mind, and/or communicate with the Matches and ask about a specific Surname or Ancestral line.
  2. Compare Cluster CA to ThruLines CA – if the same, we have reinforcing evidence; if different, the ThruLines CA may be wrong, or it may be correct genealogy, but the Match has another CA linked to the DNA (and the Cluster).
  3. Link some Clusters (and the CA) to a Triangulated Group (TG) – this will strengthen the evidence of the Ancestral line of a TG. Often the Cluster CA is more distant than the CA found in TGs at 23andMe, FTDNA, MyHeritage or GEDmatch.
  4. As the threshold decreases, there are more Matches included in the Clustering process, and those Matches tend to have more distant CAs with us. Clusters will start with only 2C and 3C; and grow to include 4C and 5C, etc. We can see the Walking The Cluster Back happening within each Cluster. Eventually each Cluster will begin to show several generations of CAs – they should all be on the same Ancestral line [if not, check with the correlated Clusters]
  5. Clustering reduces the range of possibilities. If a Cluster has a CA of A18 [Ahnentafel number for a specific 2xGreat grandparent = father’s father’s mother’s father], there are only two possibilities for the next generation: A36 and A 37 (although a Match may share a CA another generation back: A72, A73, A74, or A75). If a new Match in the next (lower threshold) Cluster run has CA = A74 – this is reinforcing evidence. If the new Match has CA = A88 – something is amiss [check for another CA, check for a correleated Cluster which is A44, or A176, etc.]

Main issues, so far:

  1. It’s been hard to specifically find Clusters which split into two Clusters a generation further out. Many Clusters have included CAs which span several generations on the same line. I’m inclined to “go with the flow” [accept the Clusters with CAs on the same line]; and not try force Clusters into a “genealogy Tree” structure. The data is just too variable. Maybe when I get down to a 6cM threshold, it may play out that way – but, I have a feeling the vagaries of random DNA will make that a wild goose chase.
  2. Several BIG variables combine to give us trends, rather than a uniform picture/pattern:
    1. Our Ancestry (size of families, documentation, probable NPEs at some level, etc.)
    2. Our random DNA from different Ancestors
    3. Which of our cousins have DNA tested.
  3. Homework is needed – Clusters can be formed, but some genealogy is needed to identify the CAs. I recommend building a Tree of Ancestors out 7 generations, wherever possible – with that AncestryDNA will find ThruLines CAs for you. Enter those CAs (or their Ahnentafel) into the Match’s Notes, so that information will be available in the different Clustering runs.

My status so far is summarized in this Table of different Cluster runs:

The first column shows the decreasing thresholds I used (basically every 5cM) – the top line is the original download: 6cM threshold, 119,068 Matches (and all their Shared Matches) which took 9 hours and is in a .txt file.

The # Matches and # Clusters are for the various cluster runs – which take negligible time to produce an Excel Cluster report.

The 3C, 4C, 5C, etc column show how many Clusters I got with CAs at those levels. (There were some 2C, but they were in Clusters that also had 3C – I counted each Cluster with the most distant cousinship which had a consensus.)

The larger threshold Clusters had multiple TGs in them. Beginning about at the 35cM threshold, some of the Clusters started showing a single, or consensus, TG – so I counted them.

Starting after the 45cM threshold, the number of included Matches about doubled with each decrease of 5cM in the threshold, and the number of Clusters began increasing dramatically. This means the amount of work for scrolling down the entire report, analyzing the data in each Cluster and determining the consensus, also increased a lot. Sometimes the CA and/or TG of a Cluster is very clear; sometimes a Match’s correlated Clusters must be reviewed and the Match assigned to another Cluster. And all new Matches need to be “Tagged”; and often other Matches need to have their “Tag” adjusted [what I alluded to in the Iterative WTCM Process], as new Matches and their new information are added to the Clusters.

The good news is in the TG column – where about 1/4 of the Clusters (and CAs) can be linked to TGs [I have TGs for over 300 Matches at AncestryDNA].

More good news: The Shared Clustering program, below 20cM, will first Cluster on the 4,515 Matches, basically retaining the 382 Clusters constant, and then go back and add in the new Matches. Therefor, all the Matches below 20cM (including about 300 with TGs, and over 1,500 with CAs) will be added to the existing Clusters, and very probably push the Cluster CA out even farther and add the a TG to many of them.

Note the trend in the 35, 30 and 25cM Cluster runs to more distant cousinships. As I find the time to analyze the 20cM Cluster run, and then runs at 15cM and 10cM, I expect this trend to continue, giving me many more CAs in the 6C, 7C and 8C range. Of course these are all clues, but I believe they are very strong clues. Time will tell as I investigate each Cluster/CA/TG more deeply.

 

[19F] Segment-ology: Walking The Clusters Back III by Jim Bartlett 20191214

Walking The Clusters Back II

I’ve found it difficult to hit the 8-16-32-64 Cluster targets. Even when I come close , they don’t wind up all in one generation. The DNA is just too random. Just look at Blaine Bettinger’s cM charts to see that there are wide cM ranges for most of the cousinships. Therefore, there is not a magic threshold for any given generation.

A better plan for WTCB is to increment the cM range a little, say 5cM, and examine the new array of Clusters. Where did the Matches from the previous Clusters go? Look for cases where the Matches from one Cluster are now split between two Clusters – almost certainly these two Clusters will represent the parents of the Cluster that held all of the Matches before. As you lower the threshold incrementally, expect the Clusters formed on close Ancestors to disappear, and new Clusters to form on more distant Ancestors. Trace the Matches from Clusters that disappear to their new Clusters for very strong clues of the ancestral line. Once you are confident of the Common Ancestor (CA) of a Cluster, there are only two options for the next generation – the two parents of the Common Ancestor previously determined. And, with the lowering of the Clustering cM threshold, new Matches will be added to the mix. These new Matches have (incrementally) smaller shared segments with you, and, in general, will tend to be more distant cousins. To be sure, each new batch of Matches (as you lower the threshold each time) will probably include a range of cousinships with you. Each Cluster CA is a hypothesis, and as new evidence (new Matches) is added to the mix, everything needs to be reviewed for consistency, and discrepancies resolved. Sometimes a discrepancy is resolved by moving the Match to a correlated Cluster.

Those who try this process are encouraged to provide feedback in the Comments to this blog.

 

[19E] Segment-ology: Walking The Clusters Back II by Jim Bartlett 20191205

Walking The Clusters Back

A Segment-ology Concept

Overview

Walking The Clusters Back (WTCB) can be a fairly complex process, so let’s start with an overview of the concept.

Pick a Clustering threshold high enough to give us 4 Clusters – one for each grandparent. Tag each Match in these Clusters to the appropriate Ancestor (grandparent). Then adjust the threshold to get (roughly) 8 Clusters (one for each Great grandparent). These Clusters would include all the Tagged Matches who would indicate which grandparent line each new Cluster was in; as well as new, generally more distant Matches, who would then separate these Clusters into Great grandparents. Tag, or re-Tag, each Match in these Clusters to the appropriate Ancestor (Great Grandparent). Then lower the threshold to get 16 Clusters and repeat the process.

Up Front DISCLAIMER: This is not a simple click-and-done process. There is homework to be done before Clustering:  documenting the Common Ancestors in our Matches’ Notes. Although the time it takes to run various Cluster reports is only a few seconds for each one, it takes some time to analyze the Matches in each Cluster and come to a consensus, then transfer those clues to the next Cluster, and then analyze those Clusters. It turns out this WTCB process is iterative –two steps forward, then one back. Each new set of Clusters brings in new Matches with clues that need to be reconciled. WTCB is  somewhat easier than Triangulating all your Matches, but it is still time-consuming work. Nevertheless,  WTCB is a great opportunity to get the most out of your Matches at AncestryDNA.

Background

Clustering is a way of grouping your Matches. Each Cluster tends to group Matches who descend from the same Ancestor. The Leeds Method groups close-cousin Matches into four Clusters which are usually our four grandparents. On the one hand this depends on knowing the Common Ancestor with some of the Matches; and on the other hand it provides a strong clue about the Ancestor of other Matches in a known Cluster. And if some Clusters are known, the others may be determined by logic. Clustering provides a powerful grouping tool. I posted about Grouping Matches here; and about several Clustering programs here.

The Leeds Method uses a high threshold (90-400cM) for the Matches to be included in the analysis, so that only 2nd or 3rd cousins are used. Each cousin in this range would usually be from only one of our four grandparents. What would happen if we lowered the threshold just enough to only have 3rd or 4th cousins most of the time? Generally they would tend to form eight Clusters – one for each of our eight great grandparents. However, as with most things “DNA”, as we decrease the cM threshold, we get a wider range of relationships – it’s not very probable that we could find a cM threshold that would produce exactly eight Clusters, or even succeeding in that, that there would be a 1-to-1 relationship to our eight Great grandparents.

And if we decided to jump to the ultimate and Cluster on 6cM, I can tell you that we’d get hundreds of Clusters. Some of them we might be able to identify, but most will look like “mush”. And, like finding a DNA Match who is a 9th cousin, we wouldn’t really have much in the way of corroborating evidence.

However, we often do have a lot of data to work with, and a good tool like Clustering that makes it fairly simple to group our Matches…

It struck me that maybe we could Walk The Clusters Back (WTCB). See the concept in the Executive Summary. In theory the 8 Clusters would be husband/wife pairs – the parents of the grandparents in the original 4 Clusters. The 4-Cluster Matches would carry a “tell-tale” Tag of whence they came to the 8 new Clusters. Then, hopefully, with clues from the new Matches in the 8 Clusters, we could determine which of the great grandparents each of the 8 Clusters represented. We are down to only two options for each Cluster, and if we can figure out one, the other would determined by logic: i.e. the other parent.

In general this worked! But it didn’t always work per the theory (the DNA is random), and the process was arduous.

Problems

– Even by adjusting the Cluster threshold 1cM at a time, the Clusters rarely came out to 8, or 16, or 32, or 64, or 128. And even when it came close, all of the new Clusters were not necessarily just the parents of the previous Clusters. Sometimes one Cluster would split into 3 or 4 Clusters (a parent and 2 grandparents, or 4 grandparents). And from one iteration to the next, some Clusters didn’t split at all. As the threshold dropped and the number of Clusters increase, the deviation from the theory increased. Not wholesale, not all of them – but enough to notice.

– In my case (one grandparent who had very few Matches, and thus very few Clusters), I knew I would get only 2 Clusters max for that one grandparent. If you have a special case, you need to make some adjustments (I used 4-8-13-26-50-98 as my target Cluster numbers and took whatever was closest).

– Sometimes in a Cluster, I didn’t get any/many Matches who had a CA. Each new Cluster depends on Matches with CAs to inform us about the probable CA for the Cluster. I could usually make up for this in the next iteration, but that meant I had to backtrack – which I did more and more as the number of Clusters increased.

– I had to find a way to transfer the Ancestor tell-tale Tags from Matches in one set of Clusters to the Matches in the next Cluster run. More on this later.

Developing the process

In step 1 of the above figure, all the Matches above a 90cM threshold are Clustered into 4 groups – one for each grandparent. In step 2, based on the known genealogy of some of the Matches, I determined the 4 grandparent Ancestors based on information I know about the close Matches in each Cluster. In step 3 I adjusted the threshold to create about 8 Clusters, and noted which of the Matches from steps 1 and 2 are now in the new Clusters. Generally the Matches from each of the 4 Ancestors (grandparents) in step 2 will be found in only 2 Clusters at the step 3 level. There are only two options for these two Great grandparents, the husband and the wife of the Ancestor in step 2. Often at this point, we don’t always know which is the husband and which is the wife, but we can be confident it’s one of each. However, during step 3 we also get additional Matches (not shown) in the Great grandparent Clusters. Some of these new Matches may be 4th cousins who would provide insights on the identity of the Great grandparents. The other Great grandparents will become known after step 4 as shown below.

In step 4 above, we lower the threshold again and get roughly 16 Clusters. I’m just using a single green arrow to show all the new Matches who were 4th, 5th and 6th cousins who formed the 16 Clusters (along with the closer red arrow Matches from steps 2 and 3). At this point, all we’d need is the CA of some of the more distant Matches, in order to determine the correct Ancestors at level 3. Note that two of the green arrows don’t point to one of the 8 level 3 Ancestors – don’t worry about it. These “errant” Matches will often fall into Clusters in the next iteration. Or maybe they will continue to be strange. Again, don’t worry about it – focus on the Ancestors you can determine. Some of the data will get a little messy as the threshold is dropped. Focus on the positive outcomes – the identity of the Cluster Ancestors at each generation.

Step 4 above shows that this is an iterative process: we sometimes need the CA couple information from one generation to resolve the individual CA of a closer generation. Note that this changes the Tags, which needs to be reconciled in a prior generation Cluster.

The Iterative WTCB Process

This led me to modify the theoretical process at the beginning of this post to the following iterative, or zig-zag, process.

  1. Set the threshold to get 4 Clusters; Assign each Cluster to a grandparent Ancestor; Tag each Match in 4-Cluster with the appropriate Ancestor.
  2. Reduce threshold to get 8 Clusters [with Tagged 4-Cluster Matches plus additional new Matches (some with CAs in Notes)]; as best you can, assign Clusters to Great grandparents; Tag all Matches with what you know so far.
  3. Re-run 4-Clusters, to insure the new Tags are OK, adjust as necessary.
  4. Re-run 8-Clusters, and adjust as necessary.
  5. Reduce threshold to get 16 Clusters [with previously tagged Matches plus additional new Matches (some with CAs in Notes)]; as best you can assign Clusters to 2xG grandparents; Tag all Matches.
  6. Re-run 8-Clusters; usually some new Tags will clarify any Great grandparents that were unclear in Step 2. Re-Tag as appropriate.
  7. Re-run 16-Clusters and adjust as necessary.
  8. Repeat steps 5, 6 and 7 (adjusted to for the next round): Reduce threshold/assign/Tag; revisit the prior Clusters/Re-Tag as appropriate; re-run the current Clusters/adjust as necessary.

This 2-steps forward then 1-step back process is necessary because it’s not always clear when one Cluster splits into two new Clusters which is the father and which is the mother – this usually becomes clear in the next round which includes more distant cousins. If something doesn’t work out in one round, it will probably get resolved in a subsequent round. Until, of course, you run out of sufficient data. Theoretically, each of your Ancestors with ThruLines Matches will be incorporated into a Cluster. Generally that would result in a lot of known Clusters. And if some of your Matches uploaded to GEDmatch or tested at one of the other companies, you’d also have TG information included in the Clusters. Happy days of DNA Painting or Chromosome Mapping!

Some Additional Items

Homework before WTCB

Create Notes in AncestryDNA for as many of your Matches as you can (hopefully you’ve been doing this all along). See my previous blog posts about AncestryDNA Notes: Format; Using Notes; ID for CAs; ID for TGs. These Notes are then handy and invaluable in the WTCB process – they provide the clues that let you determine a consensus in a Cluster. They remind you of the CA and TG information you’ve already gathered.

Work with generations

It would be possible to start with a large cM threshold to get 4 grandparent Clusters, and Tag the Matches. Then decrease the threshold by 1cM and run a new Cluster report. Has any new Cluster been added? If no, then decrease the threshold by another 1cM and repeat. If yes, it’s probably the result of a split of a previous Cluster into two parents. Identify the parents, and Tag the Matches appropriately. Then decrease the Cluster threshold by 1cM and repeat. This would take a lot of work.

The process I choose was to lower the Cluster threshold by enough to create roughly twice the number of Clusters. This would basically be creating the next generation of Ancestors. The random DNA doesn’t allow this to work perfectly, but it does tend to subdivide previous Clusters into new Clusters. Focus on identifying (through Tagged Matches) which Clusters came from previous Clusters; and then identifying the next generation of Ancestors in these new Clusters. Again, this doesn’t work perfectly each time. But don’t worry about it, a subsequent set of Clusters (with new Matches and new information), will usually provide resolution (through the iterative process – see below).

Tagging Matches

The key is to Tag Matches and carry over this information to the next set of Clusters. Now in this new set of Clusters you have Matches with information carried forward, plus new Matches, some of which have known information of their own (available in the Notes). Again, your focus is to achieve consensus in each new Cluster, and re-Tag the Matches..

I use the Ahnentafel number as the Tag. Other options include: Ancestor name, Ancestor initials, or Ms and Ps (e.g. MP for maternal grandfather). The Cluster number changes with each different iteration, so don’t use that.

One way would be to add the Tag into the Shared Clustering spreadsheet Note field for each Match; then use the Shared Clustering program to upload the Notes back to AncestryDNA and also to the Download file (where it would be available for the next Cluster run – remember the Cluster runs on the Download file only take a few seconds).

I choose to use the Shared Clustering spreadsheets (after stripping out the colorful Clusters – keeping just the data). I combine two Cluster files, sort on Match name, and copy the Tag from one Match to the other. This is relatively easy for small Clusters, it gets more time consuming with each iteration.

An iterative process

WTCB is definitely an iterative process. When we add new generations of Clusters and find new Matches with clues, we need to then backtrack to previous Cluster runs with this new information. Why backtrack? Because with each succeeding generation of Clusters we are (roughly) adding two “parent” Clusters for each one we had before. Sometimes we don’t have enough information to distinguish which of these two “parent” Clusters are the father or the mother. But in the next Cluster iteration, we find new information among the new Matches who are added in each round. When we backtrack with that information, and designate one of the two “parent” Clusters as, say, the mother, then we can impute the other one to the father – and then further Tagging all the Matches in those two Clusters, and all subsequent Cluster runs.

All the Clustered Matches are valuable

At first I intended to cull out the Matches for which I had no Notes – they didn’t appear to add any value. They didn’t reveal a TrueLines CA or a TG or anything else – nada. But then I realized they did add value – they were part of the heat in the heatmap. They added the value of being assigned to a Cluster (because of their Shared Matches), notwithstanding the fact that I didn’t know anything specific about them. In the next iteration of Clustering (with a lower cM threshold) they would also be included. If I Tagged them per the current Cluster, this information would carry over to the new Clusters. In theory (and borne out in practice), they tended to divide between two Clusters in the next iteration. Of course they couldn’t help me figure out the CA of those two clusters, but the fact that they helped form the Clusters gave weight to the new Clusters. Sometimes they were the Matches needed to actually form a new Cluster, and without them I wouldn’t get that Cluster. And their Tag told me I had only two possibilities for these new Clusters – the two parents of the Tag. And in these new Clusters there were new Matches – sometimes ones who were ThruLines Matches with a CA, or a Match (with no genealogy) who had uploaded to GEDmatch, and thus had a known TG. And sometimes I got nothing from these new Matches, but then did find new clues in the next generation of Clusters.

The value of Common Ancestors with low cM shares

Some of my Matches with CAs (from ThruLines out to 6C or Circles out to 8C) have smaller shared segment cMs – all the way down to 6cM. I treat these as clues – they can be helpful in developing a consensus with other evidence. With this WTCB process, we only have two Ancestor options at each generation, so even a 6cM Match may be a valuable clue.

Two kinds of Imputation possible

Cluster Imputation: When one of two “Parent” Clusters can be determined (usually by a  Match who has a known CA); the other “Parent” Cluster can be imputed to be the other Parent. Then all of the Matches in both “Parent” Clusters can have their Tags adjusted appropriately.

Match CA Imputation: Cluster CAs (or the ancestral line) can be imputed to all the Matches in a Cluster. On many occasions, with Clusters with a strong Match consensus for a CA, I’ve gone to other Matches in the Cluster looking for that CA and found it.

How Far Can We Go?

Continue this out as far as you want? Well, it’s not quite that easy. There are several things at play here:

  1. As noted above, the lower the cM threshold used, the wider the range of relationships we’ll get. As the threshold drops, we’ll see a wider range of Ancestors for each Cluster.
  2. Some of our Match-cousins share multiple relationships with us. Just look at your ThruLines Matches to see the number of them who share more than one ancestral couple with you. At a 20cM threshold, 65 of my 296 Matches (22%) share more than one pair of Common Ancestors with me.
  3. Some of our Match-cousins share multiple DNA segments with us. This means those Matches could share multiple Ancestors with us. Which one should we use? At a 20cM threshold, 1744 of my 4506 Matches (almost 40%) share more than one DNA segment with me.
  4. However, if a Cluster with Matches at one level, splits into two Clusters, each with some of those same Matches at the next level, it’s fairly safe to use that information as a strong clue (or hypothesis).
  5. We now have ThruLines at AncestryDNA. I have over 1,800 Matches in my ThruLines, and they “cover” all of my known Ancestors out to my 5xG grandparents (6th cousin level). Those that wind up in Clusters, provide valuable clues about the Ancestor for those Clusters. Couple that with the Matches with “Tags” from previous Clusters, and you have reinforcing (or conflicting) clues.

 

Here is a Table of my Cluster iterations – I used the highlighted ones in this study.

SC = Shared Clustering program by Jonathan Brecher (used for this WTCB analysis)

Concluclusions:

  1. It is possible to Walk The Clusters Back. I think the trickiest part is assigning Tags to Matches that stay with the Match in succeeding Cluster runs. I plan to try using the Shared Clustering Upload program for that (upload to Ancestry and the Download file)
  2. WTCB is not a simple “click” process – it involves homework (CAs in AncestryDNA Notes), and judgment, logic and time working with the Cluster iterations.
  3. It gets harder (both in logic and the number of Clusters and Matches involved) with each iteration.
  4. Some of my Matches have uploaded to GEDmatch or tested elsewhere and I have TGs for them. WTCB will provide a strong clue for the CA of these TGs.
  5. I think the realistic limit will be around 7xG grandparents (8th cousin level).
  6. WTCB helps us impute CAs to Matches.

 

[19D] Segment-ology: Walking The Clusters Back by Jim Bartlett (20191201)

Shared Clustering – A Great Tool!

A summary of some different Clustering programs is here. I’ve used, and liked, most of these programs, and I want to highlight one of them here.

Shared Clustering by Jonathan Brecher is a good, flexible tool – it does what I want, quickly. It doesn’t have the glitz of Genetic Affairs or other features offered by DNAGedcom Client. But it gets the job done for me, efficiently, and it’s free. Some detailed steps at the bottom of this post.

Some comments on Shared Clustering:

– I used a 6cM threshold and downloaded all my 118,853 Matches (and Shared Matches) at AncestryDNA in 2 hr 34min.

– I then ran a Cluster report with a 90cM threshold in 2 seconds (that’s not a typo): 34 Matches in 8 Clusters.

– Each Cluster is assigned a number.

– Each Match is shown in one Cluster – the one with the most matches to other Shared Matches – the most “heat” in a heat-map program.

– AND all of the Correlated Cluster numbers are also shown for each Match. These are Clusters where the Match also has an affinity – the Match shares some Shared Matches with the rest; just not as much as in the Cluster it’s assigned to. This is very handy, because sometimes our known relationship with the Match would be a better “fit” in one of the other Clusters – feel free to use judgment and assign a Match to any Correlated Cluster you want. OR, if a Match shares two segments with you, assign it to two Clusters. Omygosh – that violates the Cluster “rules”! But this is your data now, use your own judgment and bend the rules a little – just don’t get too wild…

– I ran multiple other Cluster reports, each one took only a very few seconds.

– With a threshold of 28cM, I get 1105 Matches in 94 Clusters – in 4 sec. For me, that’s about one Cluster for each of my 5xG grandparents. Of course, it won’t fall out exactly this way, but that’s the general area of my Tree I’d be working in with these Clusters. Remember: Clusters tend to form on individual Ancestors.

-Each report includes a one-click link to each Match’s DNA page with me – very handy.

-All of the ThruLines Common Ancestors (CAs) are also included for each Match – a convenient check, if you haven’t already summarized each of them in the Notes. Or if you are just checking for new Matches among your ThruLines.

-Each report includes all of my Notes (into which I’ve already summarized ThruLines, other CAs and TGs).

-VERY IMPORTANT: I can modify as many of my Notes as I want in the spreadsheet, and then easily click to upload that info back to AncestryDNA (it overwrites the Notes that I’ve changed – WOW, what a time saver). This uploads in under a minute. Use this feature to summarize ThruLines CAs into your Notes (if you haven’t already), and upload that back to AncestryDNA. Use the “Upload Notes” TAB.

-ALSO IMPORTANT: I can use the “Export” TAB to download my AncestryDNA data, including Notes, to an Excel file, giving me an inventory of all my Ancestry Matches (without the Clusters or Shared Matches). This is my go-to file whenever I’m searching for an Ancestry Match (like from a name or email at GEDmatch). It’s much better than using the AncestryDNA search system. And the hyper-link means I am just one click away from my Match’s DNA page with me.

Some steps to get started:

Go to this page to download the program to your PC: https://github.com/jonathanbrecher/sharedclustering/wiki

Read the Home page, and then click on the download link on the right side (if you get a popup warning, tell your PC it’s OK)

Read the Introduction TAB, then select the Download TAB

You are now working from your own PC – enter your Ancestry username and password and select your test.

I click on “Slow and Complete”, but feel free to try each of the radio buttons. I set “Lowest centimorgans” to retrieve to 6cM and get all my 118,000 Matches in about 2.5 hours. Note where your file is stored. If you set “Lowest centimorgans” to 20cM , you’ll get all your “forth cousins” and closer in less than 10 minutes – this includes all the Matches who are used as Shared Matches.

After the Download is complete, select the Cluster TAB – the Saved Data File (from the Download) is usually shown by default, but you can also use files downloaded from other companies, if you want. The Cluster output file usually shows by default too – it’s the same name as the Saved Data File with “-clusters.xlsx” appended instead of “.txt” You can change the name of this file if you want – I usually append the default cM I’m using (e.g. 28) after “clusters” so I can save them all with different, recognizable, names. Just make sure both files (the Download “txt” file and the title of the new clusters “xlsx” file) are in the same folder. I’ve also set up a Clustering folder, and a sub-folder for the Shared Clustering program, and separate sub-sub folders with a date of the initial download file (e.g. 20191123) – so the Download and each of the Cluster runs would go in that (20191123) folder. A little work on organizing a file system really helps me remember what I’m doing….

Click on the “Cluster completeness” button of your choice; and type in the “Lowest centimorgans” box. Then hit Process Saved and wait about 2 seconds.

This Chart shows the relationship between the cM Threshold selected and the number of Clusters that result (for the Download of my data). Your results may vary, but the shape of the curve will be the same. The curve flattens below a 20cM threshold, because the Shared Clustering uses the Clusters at the 20cM threshold as a base and adds the other, smaller cM, Matches to the Clusters formed at 20cM. The smaller Matches (below 20cM) often have Shared Matches (all of whom are 20cM or higher), but there are no additional Shared Matches below 20cM. Experiment with your Download – it only takes a few seconds to change the Threshold cMs and get a new set of Clusters. NB:The Cluster numbers are uniquely formed during each Cluster report. They do NOT follow the Matches to other Cluster reports – they shouldn’t, because the new Clusters (formed on Ancestors) are different at different generations.

Jonathan monitors the Shared Clustering facebook page, and he’s always been very responsive. It’s good to visit that page and follow the conversations. And ask questions. And request improvement features.

https://www.facebook.com/groups/sharedclustering/

I will try to post soon on my Walk The Clusters Back project, using Clusters that should be focused on different generations in my Tree – very successful.

If I’ve messed up anything in this review of Shared Clustering, I hope Jonathan Brecher and/or other readers will provide feedback in the comments.

 

[19C] Segment-ology: Shared Clustering – A Great Tool! by Jim Bartlett (20191129)

Grouping Matches – Try It!

A Segment-ology TIDBIT

We can group Matches several ways:

  1. Each Triangulated Group (TG) includes Matches who share the same Common Ancestor (CA). This is based on your DNA segment from an Ancestor, which other Matches also share. 23andMe, MyHeritage and GEDmatch all have tools for Triangulation.
  2. Clustering includes Matches who share multiple Shared Matches with each other – they tend to be based on the same Ancestor. The Leeds Method focuses on 4 groups representing our 4 grandparents. This is based on the probability that groups of Shared Matches will probably have the same Ancestor. When the lowest threshold is used (6cM), all of the company Matches are included and the Clusters tend to approximate a one-to-one relationship with TGs. This is a good tool to group our Matches at AncestryDNA and FamilyTreeDNA. I blogged about some Clustering programs here.
  3. We can also form Clusters based on ethnicity, geography, Haplogroups, etc., but, in general, these will not be as precise as TGs and Shared Match Clustering. These Clusters are, however, often very helpful in homing in on a CA.

Groups can help us in several ways:

  1. Everyone in a group should have the same objective: finding the CA. There is synergy in a group; and working together often results in a better outcome. One person’s Brick Wall or bio-Ancestor (vs. an NPE) may be in the Trees of other Matches in the Group.
  2. Close Cousins and their CAs with you provide a beacon toward the more distant CA, and limit the possibilities that would otherwise need to be explored.
  3. Once several Matches in a group agree on a CA, that CA line can be imputed to the other Matches. Many times I have searched a Match’s Tree for a specific Ancestor (highlighted in the Cluster), and found it! I’ve also communicated with Matches with no/small Trees and asked specifically about a surname and gotten positive/helpful responses.
  4. Use Clustering to form groups at FTDNA, MyHeritage and 23andMe, and use them as a basis for TGs – Triangulation goes much more quickly when you only compare segments that will probably Triangulate.

We can form Triangulated Groups at 23andMe, MyHeritage, GEDmatch, and, with a Clustering pre-start, at FamilyTreeDNA – but those companies, generally, do not offer much in the way of genealogy tools, and only a few of the Matches have robust Trees. On the other hand, AncestryDNA has a lot of good Trees, and great tools like ThruLines, but no DNA segment data – however, we can do Clustering. DNA and No Trees; OR Trees and NO DNA – it’s frustrating… So how can we merge the TGs and AncestryDNA’s Clusters?? More on this later…

BOTTOM LINE: We need both Triangulated Segments and Triangulated Genealogy to be in sync (reinforcing each other) before we can have confidence in our conclusions. One without the other is incomplete research.

 

[AQ] Segment-ology: Grouping Matches – Try It!  TIDBIT by Jim Bartlett 20191128

Extending the MRCA of a TG through Clusters

A Segment-ology TIDBIT

Triangulated Groups (TGs) are one way to group your Matches – grouping Matches who share overlapping DNA segments with each other. The DNA segment represented by a TG, is passed down to you from one of your parents, and from more distant Ancestors on that side. As we find Most Recent Common Ancestors (MRCAs) with Matches in a TG, we begin to learn which ancestral line passed down the TG segment – a Common Ancestor.

Clustering is another way to group your Matches – grouping Matches who share other Shared Matches with each other. The Matches in a Cluster also tend toward a Common Ancestor.

The companies where Triangulation is possible generally do not have many robust Trees. And so the TGs do not have many MRCAs. AncestryDNA has many robust Trees and a ThruLines tool that determines many Common Ancestors (CAs) , but does not provide the DNA data needed for Triangulation.

Is there a way to combine the best of both worlds? I think there is. TGs and Clusters should be grouping on the same thing – an ancestral line – a Common Ancestor. When a TG and a Cluster clearly have some of the same Matches, I think the deeper MRCAs in the Cluster, can be imputed to the TG.

This is another method of Walking The Ancestor Back (WTAB) in a TG. You already have a TG with some MRCAs along the same line. The Most Distant Common Ancestor (MDCA) in a TG usually represents a couple, one of whom passed down the TG segment. The next generation back has only two possibilities: the paternal or maternal side of the MDCA.

The MRCAs in a Cluster aligned with the TG provide a strong clue in reinforcing and extending the CA for a TG.

 

[22AP] Segment-ology: Extending the MRCA of a TG through Clusters TIDBIT by Jim Bartlett 20191118

Ahnentafel 37P – Breaking Through a Brick Wall

This is the first in what may be a series of Ancestor Stories that have been made possible by DNA.

Background on Thomas NEWLON, Ahnentafel 36

This story starts on firm ground with my ancestor, Thomas NEWLON (my Ahnentafel 36). I have solid evidence of Thomas NEWLON. We have 3 matching Y-DNA kits from men who descend from him and his father which prove his NEWLON line, at least back to his father, James NEWLON. The Y-DNA Haplogroup is R1b1a2.

Per the Personal Property Tax Lists (PPTL) of Loudoun Co, VA, Thomas NEWLON is listed 1788-1802 (adjacent to his father James). If we assume that he was, say, 19 in 1788 (many fathers cheated a year or two on their son’s age to avoid paying taxes from the 16th birthday), his birth year would be 1769. This is a good “fit” as his parents, James NEWLIN and Catherine BENNETT were married 7 Apr 1768 in Chester, PA. The NEWLONs in this part of PA were Quakers. At a Warrington Monthly Meeting on 11 Jun 1768 James NEWLAND was disowned for marrying out. I don’t have any records for the next 20 years (until the 1788 Loudoun Co, VA PPTL). Many say some of Thomas’s siblings were born in Culpeper Co, VA, but I’ve not seen any such records. In any case, I have the records showing Thomas NEWLON was living in Loudoun Co, VA from 1788 to 1802.

Thomas NEWLON’s eldest child was Cecelia, who was born 3 Aug 1793 per her obituary. This means that Thomas married someone probably in 1792, and almost certainly in Loudoun Co, VA where he was living. Let’s say his wife was born in 1774, and married at age 18 – not uncommon for that time period. We have several pieces of later evidence that her family was also living in Loudoun Co, VA at that time and at least up until about 1810.

From his 1813 Will, Thomas NEWLON’s first four children were Cele [Cecelia] 1793, William 1795, John 1798 [my ancestor] and Sarah 1800 – the birth years from other evidence. In 1802 Thomas NEWLON is listed on the PPTL of both Loudoun Co, VA and Harrison Co, VA, so it’s safe to assume this family of six, moved to Harrison Co, VA in 1802.

Thomas NEWLON is in the Harrison Co, VA PPTL from 1802 to 1813. Thomas wrote his Will on 7 Jul 1813, and the 1814 PPTL listed: Thomas NEWLON (heirs). His will had specific instructions for his first four children, and named three more children and wife Sarah. All seems in order… Except for the Harrison Co, VA 20 Jul 1805 marriage record for Thomas NEWLON and Sarah POWELL. And it turns out Sarah was the widow Sarah POWELL – her maiden name was Sarah STROTHER (daughter of Reuben STROTHER and Susannah BARTLETT) and she had married 17 Apr 1787 in Loudoun Co, VA to Henry POWELL who died c1804. Sarah POWELL is in the Harrison Co 1804 PPTL and “Henry POWELL heirs” are listed in the 1805 PPTL. Sarah brought 5 to 7 POWELL children to the NEWLON household when they married in 1805. What a packed house…

Thomas NEWLON’s wife

But back to the story – who, then, was Thomas NEWLON’s first wife? She would be the mother of son John NEWLON, my ancestor, and therefore whoever she is, she’s also my Ancestor. John NEWLON is my Ahnentafel 18; Thomas NEWLON is my Ahnentafel 36; and his first wife is my Ahnentafel 37.

I have searched for any clues since the 1980s, and others had been looking long before then… nothing. One researcher claimed he had proof she was Martha JANNEY, but went to his grave refusing to show the evidence. [Many people with online Trees, show Martha JANNEY as Thomas’s first wife. I spent a day at the history library at WVU in Morgantown, WV (my alma mater), where some said the JANNEY proof had been preserved… nothing. I searched the JANNEYs in the Loudoun Co, VA courthouse and several libraries… and found nothing. Well, I did find that the JANNEY’s were Quakers and most lived in one area of Loudoun Co; and almost none had Slaves (Slavery was against their religion). The only clue I ever found was in the death record of her son, William NEWLON who died 21 Sep 1881 Simpson, Taylor Co, WV. It listed his parents as Thomas and Susan. Informant – son, C L NEWLON [Chapman L]. Susan! Well a small thread to hang onto.

I also note that Thomas’s first child with second wife Sarah was a girl, whom they named Susannah. And all but one of Thomas’ first four children named their first daughter Susannah. So, I’m convinced Thomas NEWLON’s first wife was named Susan, or Susannah.

So, Thomas NEWLON’s first wife was Susan…

During all these decades of research, most of us kept running into the same family story: A year or two after settling in Harrison Co in the western part of VA in 1802, Thomas NEWLON’s wife decided to return to her parents’ home in Loudoun Co, VA to get a Slave to help her “on the frontier”. She and son William (then maybe about 8 years old) rode horseback to Loudoun Co. While at her parents in Loudoun Co, she was poisoned and died (the Slave family did not want to be torn apart). What a tragedy that was! This story would explain the marriage of Thomas NEWLON, with 4 children, to Sarah STROTHER POWELL in 1805.

An 1878 Newspaper article in the Leesburg, VA Mirror contained a brief notice from “The Clipper” [in MO] of the death of Mrs. Cecelia McPHERSON, with occurred in Ralls Co: The deceased was … born in Loudoun Co, VA 3 Aug 1793, the oldest of a family of Wm [sic] NEWLON and her childhood was spent in the wilderness of the western portion of that state. In 1808 she returned to the place of her birth and was married there 1 Apr 1810 to Stephen McPHERSON, whose faithful consort she was until his death in 1847.

This article explains a lot. Cecilia’s birthdate; she was the eldest; she returned to Loudoun Co, VA in 1808 [when she was 15 years old – almost certainly to live with her grandparents – probably on Susan’s side, as Thomas’s parents, William and Catherine, were near the end of their lives. NB: Wm [sic] NEWLON is clearly wrong – wrong in the original MO newspaper, or wrong in the VA newpaper, or wrong in a subsequent transcription. Cecelia’s father was Thomas NEWLON.

My Ancestral Brick Wall: Susan LNU c1774-c1802

So we are looking for a Susan [Last Name Unknown]; born c1774; married c1792 in Loudoun Co, VA (age 18); had 4 children: 1793, 1795, 1798, 1800; moved from Loudoun to Harrison Co, VA c1802; died c1804 (age 30); and her family was in Loudoun Co, VA at least from 1792 to 1810 and had Slaves.

I was stuck on this Brick Wall until 2017, when I turned to autosomal DNA for more clues.

Triangulated Group [01S24]

I’ve been Triangulating shared segments since about 2011, and had already formed about 370 Triangulated Groups (TGs) which covered basically all of my DNA – all 45 chromosomes. Thomas NEWLON and Susan are my 3xGreat grandparents – at the 4th cousin (4C) level. So I looked at all the TGs with closer cousin-Matches with known Common Ancestors (CAs) pointing to my NEWLON ancestry. Several of these TGs already had more distant cousins on the NEWLON side, so I set those aside. I finally decided to start with a large TG that I called [01S24].

TG [10S24] already had four Matches who were 4C from Thomas NEWLON. The TG included over 100 Matches, and none had been found to go back up the NEWLON ancestry. In addition, there were over 25 Matches from AncestryDNA who had uploaded to GEDmatch or tested at another company and I knew their Ancestry name. I had the AncestryDNA Helper installed in my Chrome browser, so I was able to visit each of these Matches and, in the lower left of their page, I could download all of their Ancestors to a spreadsheet. I did this, and then combined all the spreadsheets into one and sorted on the Ancestors.

I descend from a CUMMIN/GS

The clear Surname “winner” was CUMMINS/CUMMINGS – 9 of my AncestryDNA Matches had CUMMIN/GS ancestry. Bingo! This was a new surname for me. I then searched my FamilyTreeDNA Matches for this surname. In [01S24] 6 of them had CUMMING/S. At MyHeritage, I have 12 Matches who Triangulate in [01S24] and have CUMMIN/GS ancestry. I messaged my 23andMe Matches in [01S24] and 4 of them reported CUMMIN/GS ancestry. Yes, some of the Matches had tested at multiple companies, but some at each company were new – additional evidence that, somehow, CUMMIN/GS was in my Ancestry, and on TG [01S24].

Next was the process of creating a CUMMIN/GS Tree. A number of my Matches had already traced their line back to Alexander CUMMINS b 1677 Northumberland Co, VA, d 1738 Prince William Co, VA; m 1694 Northumberland Co, VA Sarah MUTTONE/MUTTONE b 1677 Northumberland Co, VA, d about 1738 too. Several of their children died in Fauquier Co, VA. Two things soon became clear: 1) many of their descendants went to Fauquier Co, VA, and some went to adjacent Loudoun Co, VA; and 2) there is a lot of conflicting data about this family (particularly with people who can trace back to Fauquier and Loudoun and then accept other peoples Trees who say those CUMMIN/GS were from Scotland or MD). The records are few and, it appears to me, a lot of guesswork had taken place. But the DNA tells me most, if not all, of the CUMMIN/GS in Loudoun and Fauquier Co are related to each other – at least on segment [01S24]. Within [10S24] most of the Matches shared a DNA segment with most of the others. And I think, as I share this story with all of the Matches in TG [01S24], and they confirm that they match each other (and possibly others), they will come to the same conclusion that they probably, somehow, descend from Alexander CUMMINS and Sarah MUTTONE. The weight of the evidence was that my Ancestor Susan was a CUMMINGS. Other, far less likely, alternatives are discussed below.

My Ancestor, Ahnentafel 37, was Susan CUMMINGS c1774-c1802 (hypothesis)

Next, I focused on the CUMMINGS in Loudoun Co, VA. In this effort, Pat Duncan was a big help. She has transcribed many of the Louduon Co, VA early records and published a series of indexed books. She graciously emailed me the early Tax Lists for CUMMIN/GS, and pointed out there was only one man who had Slaves in the time period I was looking at: John CUMMINGS.

John CUMMINGS

There were two John CUMMINGS in the Tax Lists – one had stud horses and race horses and Slaves, and the other did not. In working through all the records I came up with a John CUMMINGS in the Loudoun Co, VA Personal Property Tax Lists from 1787 to about 1811, almost always with horses and Slaves. On 25 Mar 1811 John CUMMINGS and wife Jane of Loudoun sold land. On 12 Apr 1813 there are two records in Loudoun Co, VA:

  1. John CUMMINGS married Margaret EMERSON
  2. John CUMMINGS of Culpeper Co, VA to Margaret EMMISON of Loudoun – a marriage contract for Margaret to receive a child’s portion in lieu of dower for sake of John’s children by former wife.

Searching back through the records we find John CUMINGS married Jane JOPSON 23 Jun 1780 in Newtown, Bucks Co, PA. There are other Bucks Co, PA records from 1781 to 1785 with John CUMMINGS, including the 1785 Will of Richard JOPSON which mentions daughter Jane CUMMINGS.

John CUMMINGS b 1746 VA; d 1826 Culpeper Co, VA

Other records for John CUMMINGS, to trace his life, have been hard to find. Most researchers, including LDS FamilySearch record 29QJ-C42, have John CUMMING born 1746 Ireland; died 10 Oct 1826 Culpeper Co, VA. And John CUMMINGS is in the 1820 Culpeper Co, VA Census (born before 1775, wife born before 1775, 7 Slaves). Given the many DNA Matches to the CUMMIN/GS in Loudoun and Fauquier Co, VA, I’m pretty sure this John CUMMINGS was born in VA, not in Ireland. However, I have not found a record, yet, that indicates a birth year of 1746. So to summarize so far:

Susan CUMMINGS born c1774; married c1792 in Loudoun Co, VA (age 18); had 4 children: 1793, 1795, 1798, 1800; moved from Loudoun to Harrison Co, VA c1802; died c1804 (age 30); and her family was in Loudoun Co, VA at least from 1792 to 1810 and had Slaves.

John CUMMINGS b 1746 VA; d 10 Oct 1826 Culpeper Co, VA; m 23 Jun 1780 Bucks Co, PA Jane JOPSON (b 1753); moved to Loudoun Co, VA about 1787, where Jane died in 1811; John m 12 Apr 1813 Loudoun Co, VA Margaret EMERSON, and they then lived in Culpeper Co, VA.

But who was Susan’s mother?

Susan was born about 6 years before John CUMMINGS married 1780 Jane JOPSON. And if John CUMMINGS was really born in 1746, he would have been 34 years old in 1780. That’s not usual for this time and place. I believe John CUMMINGS had an earlier wife – someone he married before 1774 and who probably died c1778 – who was the mother of Susan. I still don’t have a clue as to who that first wife might be, but I’m still getting Matches who are cousins on the CUMMINGs line. I’m pretty sure John CUMMINGS did have an early wife and that Susan CUMMINGS was his daughter. That’s my hypothesis.

My Ancestor, Ahnentafel 74, was John CUMMINGS b 1746 VA; d 1826 Culpeper Co, VA

I’ve now built a tentative Tree connecting John CUMMINGS back to Alexander CUMMINS and Sarah MATTONE. And I’ve connected most of the 17 Matches in [01S24] into this Tree. Based on the Triangulated Group, I’m convinced that all of them tie back to Alexander and Sarah somehow. And I’m sure that other Matches in [01S24] will be found to have this ancestry, too. I’m also sure, based on the number of overall Matches, and the fact that they the tie to the CUMMINS lines at different generations (from 5th to 8th cousins) that the DNA came down this CUMMIN/GS line to segment [01S24]. In [01S24] the DNA does not go back on any of the wives’ lines, it goes all the way back to Alexander CUMMINS. The fact that this DNA comes down the all-male line for 3 generations is why I’m seeing so many Matches with CUMMIN/GS ancestry in this segment. Other TG segments that go back to Thomas NEWLON and Susan CUMMINGS may well go further back through Susan’s mother. Then I can repeat this process all over and search for the surname and Ancestor for that Brick Wall. As the old genealogy saying goes: you solve one Ancestor and it generates two more to solve.

NEXT: Search for Ahnentafel 75 – Susan’s mother.

I hope this story shows the integration of Y-DNA and atDNA tools with traditional genealogy researching tools. This story could not be told without a good mix of both.

Notes:

  1. Does Susan have to be a CUMMINGS? No, her mother could be a CUMMINGS and her father could be some other surname… However, almost all of my Matches in [01S24] share 20 to 46cM with me. That’s a lot for a 5th cousin, much less a more distant one. So I’m pretty sure Susan is a CUMMINGS.
  2. I estimate that about 24 of my 371 TGs will be Ancestral to Thomas NEWLON and Susan CUMMINGS – say 12 TGs for the NEWLON side and 12 TGs for the CUMMINGS side. 6 of them will go back on John CUMMINGS’ side (including [01S24]); and 6 of them will go back through the first wife of John CUMMINGS. Those are the 6 I need to identify and start working on. NB: Each of the other 15 3xGreat grandparent couples will also have about 24 TGs. Of course, DNA is random, so our actual experience may vary a little.

 

[23-37P] Segment-ology: Ahnentafel 37P – Breaking Through a Brick Wall by Jim Bartlett 20190804