Shared Clustering – A Great Tool!

A summary of some different Clustering programs is here. I’ve used, and liked, most of these programs, and I want to highlight one of them here.

Shared Clustering by Jonathan Brecher is a good, flexible tool – it does what I want, quickly. It doesn’t have the glitz of Genetic Affairs or other features offered by DNAGedcom Client. But it gets the job done for me, efficiently, and it’s free. Some detailed steps at the bottom of this post.

Some comments on Shared Clustering:

– I used a 6cM threshold and downloaded all my 118,853 Matches (and Shared Matches) at AncestryDNA in 2 hr 34min.

– I then ran a Cluster report with a 90cM threshold in 2 seconds (that’s not a typo): 34 Matches in 8 Clusters.

– Each Cluster is assigned a number.

– Each Match is shown in one Cluster – the one with the most matches to other Shared Matches – the most “heat” in a heat-map program.

– AND all of the Correlated Cluster numbers are also shown for each Match. These are Clusters where the Match also has an affinity – the Match shares some Shared Matches with the rest; just not as much as in the Cluster it’s assigned to. This is very handy, because sometimes our known relationship with the Match would be a better “fit” in one of the other Clusters – feel free to use judgment and assign a Match to any Correlated Cluster you want. OR, if a Match shares two segments with you, assign it to two Clusters. Omygosh – that violates the Cluster “rules”! But this is your data now, use your own judgment and bend the rules a little – just don’t get too wild…

– I ran multiple other Cluster reports, each one took only a very few seconds.

– With a threshold of 28cM, I get 1105 Matches in 94 Clusters – in 4 sec. For me, that’s about one Cluster for each of my 5xG grandparents. Of course, it won’t fall out exactly this way, but that’s the general area of my Tree I’d be working in with these Clusters. Remember: Clusters tend to form on individual Ancestors.

-Each report includes a one-click link to each Match’s DNA page with me – very handy.

-All of the ThruLines Common Ancestors (CAs) are also included for each Match – a convenient check, if you haven’t already summarized each of them in the Notes. Or if you are just checking for new Matches among your ThruLines.

-Each report includes all of my Notes (into which I’ve already summarized ThruLines, other CAs and TGs).

-VERY IMPORTANT: I can modify as many of my Notes as I want in the spreadsheet, and then easily click to upload that info back to AncestryDNA (it overwrites the Notes that I’ve changed – WOW, what a time saver). This uploads in under a minute. Use this feature to summarize ThruLines CAs into your Notes (if you haven’t already), and upload that back to AncestryDNA. Use the “Upload Notes” TAB.

-ALSO IMPORTANT: I can use the “Export” TAB to download my AncestryDNA data, including Notes, to an Excel file, giving me an inventory of all my Ancestry Matches (without the Clusters or Shared Matches). This is my go-to file whenever I’m searching for an Ancestry Match (like from a name or email at GEDmatch). It’s much better than using the AncestryDNA search system. And the hyper-link means I am just one click away from my Match’s DNA page with me.

Some steps to get started:

Go to this page to download the program to your PC: https://github.com/jonathanbrecher/sharedclustering/wiki

Read the Home page, and then click on the download link on the right side (if you get a popup warning, tell your PC it’s OK)

Read the Introduction TAB, then select the Download TAB

You are now working from your own PC – enter your Ancestry username and password and select your test.

I click on “Slow and Complete”, but feel free to try each of the radio buttons. I set “Lowest centimorgans” to retrieve to 6cM and get all my 118,000 Matches in about 2.5 hours. Note where your file is stored. If you set “Lowest centimorgans” to 20cM , you’ll get all your “forth cousins” and closer in less than 10 minutes – this includes all the Matches who are used as Shared Matches.

After the Download is complete, select the Cluster TAB – the Saved Data File (from the Download) is usually shown by default, but you can also use files downloaded from other companies, if you want. The Cluster output file usually shows by default too – it’s the same name as the Saved Data File with “-clusters.xlsx” appended instead of “.txt” You can change the name of this file if you want – I usually append the default cM I’m using (e.g. 28) after “clusters” so I can save them all with different, recognizable, names. Just make sure both files (the Download “txt” file and the title of the new clusters “xlsx” file) are in the same folder. I’ve also set up a Clustering folder, and a sub-folder for the Shared Clustering program, and separate sub-sub folders with a date of the initial download file (e.g. 20191123) – so the Download and each of the Cluster runs would go in that (20191123) folder. A little work on organizing a file system really helps me remember what I’m doing….

Click on the “Cluster completeness” button of your choice; and type in the “Lowest centimorgans” box. Then hit Process Saved and wait about 2 seconds.

This Chart shows the relationship between the cM Threshold selected and the number of Clusters that result (for the Download of my data). Your results may vary, but the shape of the curve will be the same. The curve flattens below a 20cM threshold, because the Shared Clustering uses the Clusters at the 20cM threshold as a base and adds the other, smaller cM, Matches to the Clusters formed at 20cM. The smaller Matches (below 20cM) often have Shared Matches (all of whom are 20cM or higher), but there are no additional Shared Matches below 20cM. Experiment with your Download – it only takes a few seconds to change the Threshold cMs and get a new set of Clusters. NB:The Cluster numbers are uniquely formed during each Cluster report. They do NOT follow the Matches to other Cluster reports – they shouldn’t, because the new Clusters (formed on Ancestors) are different at different generations.

Jonathan monitors the Shared Clustering facebook page, and he’s always been very responsive. It’s good to visit that page and follow the conversations. And ask questions. And request improvement features.

https://www.facebook.com/groups/sharedclustering/

I will try to post soon on my Walk The Clusters Back project, using Clusters that should be focused on different generations in my Tree – very successful.

If I’ve messed up anything in this review of Shared Clustering, I hope Jonathan Brecher and/or other readers will provide feedback in the comments.

 

[19C] Segment-ology: Shared Clustering – A Great Tool! by Jim Bartlett (20191129)

Grouping Matches – Try It!

A Segment-ology TIDBIT

We can group Matches several ways:

  1. Each Triangulated Group (TG) includes Matches who share the same Common Ancestor (CA). This is based on your DNA segment from an Ancestor, which other Matches also share. 23andMe, MyHeritage and GEDmatch all have tools for Triangulation.
  2. Clustering includes Matches who share multiple Shared Matches with each other – they tend to be based on the same Ancestor. The Leeds Method focuses on 4 groups representing our 4 grandparents. This is based on the probability that groups of Shared Matches will probably have the same Ancestor. When the lowest threshold is used (6cM), all of the company Matches are included and the Clusters tend to approximate a one-to-one relationship with TGs. This is a good tool to group our Matches at AncestryDNA and FamilyTreeDNA. I blogged about some Clustering programs here.
  3. We can also form Clusters based on ethnicity, geography, Haplogroups, etc., but, in general, these will not be as precise as TGs and Shared Match Clustering. These Clusters are, however, often very helpful in homing in on a CA.

Groups can help us in several ways:

  1. Everyone in a group should have the same objective: finding the CA. There is synergy in a group; and working together often results in a better outcome. One person’s Brick Wall or bio-Ancestor (vs. an NPE) may be in the Trees of other Matches in the Group.
  2. Close Cousins and their CAs with you provide a beacon toward the more distant CA, and limit the possibilities that would otherwise need to be explored.
  3. Once several Matches in a group agree on a CA, that CA line can be imputed to the other Matches. Many times I have searched a Match’s Tree for a specific Ancestor (highlighted in the Cluster), and found it! I’ve also communicated with Matches with no/small Trees and asked specifically about a surname and gotten positive/helpful responses.
  4. Use Clustering to form groups at FTDNA, MyHeritage and 23andMe, and use them as a basis for TGs – Triangulation goes much more quickly when you only compare segments that will probably Triangulate.

We can form Triangulated Groups at 23andMe, MyHeritage, GEDmatch, and, with a Clustering pre-start, at FamilyTreeDNA – but those companies, generally, do not offer much in the way of genealogy tools, and only a few of the Matches have robust Trees. On the other hand, AncestryDNA has a lot of good Trees, and great tools like ThruLines, but no DNA segment data – however, we can do Clustering. DNA and No Trees; OR Trees and NO DNA – it’s frustrating… So how can we merge the TGs and AncestryDNA’s Clusters?? More on this later…

BOTTOM LINE: We need both Triangulated Segments and Triangulated Genealogy to be in sync (reinforcing each other) before we can have confidence in our conclusions. One without the other is incomplete research.

 

[AQ] Segment-ology: Grouping Matches – Try It!  TIDBIT by Jim Bartlett 20191128

Extending the MRCA of a TG through Clusters

A Segment-ology TIDBIT

Triangulated Groups (TGs) are one way to group your Matches – grouping Matches who share overlapping DNA segments with each other. The DNA segment represented by a TG, is passed down to you from one of your parents, and from more distant Ancestors on that side. As we find Most Recent Common Ancestors (MRCAs) with Matches in a TG, we begin to learn which ancestral line passed down the TG segment – a Common Ancestor.

Clustering is another way to group your Matches – grouping Matches who share other Shared Matches with each other. The Matches in a Cluster also tend toward a Common Ancestor.

The companies where Triangulation is possible generally do not have many robust Trees. And so the TGs do not have many MRCAs. AncestryDNA has many robust Trees and a ThruLines tool that determines many Common Ancestors (CAs) , but does not provide the DNA data needed for Triangulation.

Is there a way to combine the best of both worlds? I think there is. TGs and Clusters should be grouping on the same thing – an ancestral line – a Common Ancestor. When a TG and a Cluster clearly have some of the same Matches, I think the deeper MRCAs in the Cluster, can be imputed to the TG.

This is another method of Walking The Ancestor Back (WTAB) in a TG. You already have a TG with some MRCAs along the same line. The Most Distant Common Ancestor (MDCA) in a TG usually represents a couple, one of whom passed down the TG segment. The next generation back has only two possibilities: the paternal or maternal side of the MDCA.

The MRCAs in a Cluster aligned with the TG provide a strong clue in reinforcing and extending the CA for a TG.

 

[22AP] Segment-ology: Extending the MRCA of a TG through Clusters TIDBIT by Jim Bartlett 20191118