D I Y Clustering

Posted on February 18, 2019 by Jim Bartlett

A Segment-ology TIDBIT

Automated Match Clustering involves large spreadsheets; selecting max and min thresholds; downloading data; using third-party tools; and then analyzing the clusters. Is there a different way to Cluster AncestryDNA Matches? I think there is… Do-It-Yourself Clustering.

I think we can select a Match and then look at our Shared Matches and then, often, see a trend or pattern among them. If we’ve used the Note boxes liberally (see below), we might see known Common Ancestors (CA) among the Shared Matches and/or a known Triangulated Group (TG) among them. Note that we sometimes know one of these Building Blocks (CA, TG) without knowing the other – that’s OK, they are both important clues that are “pointers” to a Cluster.

So… in the Notes for each AncestryDNA Match, we select some notation to indicate what this trend or pattern is. This notation would be the tentative “Cluster ID”. We could use a Surname [PLUNKETT]; or a Couple [PLUNKETT/HAM]; or the Ahnentafel for this couple [104/105] (or just the shorthand version: 104 – see a CA ID method here). Or, for Matches who have uploaded elsewhere, and we know the DNA segment(s), we could use that data (see one method, the TG ID, here). Feel free to use whatever system works for you to identify which Cluster you feel pretty sure about for this Match. If it’s not clear, just skip this Match and come back later (we’d do this a lot for Matches with Private or No or skimpy Trees). Note: I believe each Cluster is based on an Ancestral line. Clusters around a closer CA will probably have multiple TGs; a more distant CA will tend to have one TG.

A real aid in this process is MEDBetterDNA. It’s a Chrome extension, so you must use the Chrome browser (free). It has several features but the critical one here is that you’ll see all your Match Notes all the time (no need to click on the little “page” icon). Google MEDBetterDNA and use checkbox: “always show Notes”. It REALLY helps in looking down a long list of Shared Matches. [BTW: it would be very nice for AncestryDNA to make this standard…].

To use this process, we also need to use the Note box – we need to enter any CA or TG we find for a Match. I started with all my Hints – each one had at least one CA. And, as I looked over all my closest Matches, I found more CAs. Sometimes I found Matches at GEDmatch, which I could Triangulate and link to AncestryDNA Matches, giving me a TG in the Note box. Whatever system you’ve used to find cousins with CAs or TGs, enter what you’ve found in the Note box. Then, for all Matches over 20cM, you’ll see those Notes when they are in a Shared Match list. The homework assignment here is to enter Notes for as many of your 4th cousins (4C), or closer, as possible. Note that you’d need this same data in order to get anything out of a Match Clustering Matrix spreadsheet.

Then, starting with 4C (saving closer cousins for later), and look at each Match. See if you can tell from their Notes and the Shared Matches’ Notes what the Cluster would be. Maybe there will be multiple choices. Whatever it is, enter your Cluster ID in the beginning of the Note box. Go to the next 4C Match and repeat. Skip any Match you want – this is an iterative process, and you may need to go through your list several times – I believe the Cluster IDs will “tighten up” – become more solid – with each iteration. At some point, even the Matches with Private/No/Skimpy Trees will have lots of Shared Matches with the same Cluster ID. Give that Match a Cluster ID, too!

After you’re satisfied with the 4C list, you can cycle back to the 3C list, and confirm that they are compatible with the trend of their Shared Matches. Each 3C may be associated with several Clusters. In fact some of your 4C Matches may have a few Clusters. This is OK – but multiple Clusters should be for adjacent ancestral lines which eventually converge (marry) at some level.

At this point, you can look at Matches beyond the 4C level. Many of my Hints with CAs are beyond 4C. Many of them will have Shared Matches (4C or Closer), and the Notes will point toward a Cluster ID. Although these distant Matches won’t show up in a Shared Match list, I’d still enter the Cluster ID in the Note box, just to keep track. You’d also need to list these separately – in a spreadsheet or on paper. However, if you put a hashtag, like #Cluster in your Notes, you can search on different Clusters. I just searched my AncestryDNA Results for #A0856 [my hashtagged CA ID] and 10 Matches popped up, including Matches with 6.3cM, 7.4cM and 13.2cM.

If I decided the above distant CA, #A0856, was a good Cluster ID, I’d enter #C0856 as the first entry in the Notes for all the Matches I thought were in that Cluster. Later, I could make a download and sort on the Note field to group all the Matches by Clusters. Or I could easily check my work against an Automated Match Clustering Program. Hopefully there wouldn’t be many differences.

The beauty – and benefit – of DIY Clustering:

You can put a Match into more than one Cluster! Clustering programs have trouble with close cousins and multiple CAs/TGs – they don’t fit into just one Cluster. But what’s wrong with putting a Match into two or three Clusters if they really fit? Nothing – you are in charge with DIY Clustering.
With Automated Match Clustering, you must have all your clues in place, up front. With DIY Clustering you can select which Clusters to work on first, and then get to the others later. Work at your own pace.
DIY Clustering is primarily for AncestryDNA Matches, but you can also compare these Clusters with Match CAs and TGs from other companies. They should align and reinforce each other.

So, if you’d rather not use a Match Clustering program/spreadsheet, Do-It-Yourself. It involves entering Notes in a lot of Matches, but that is a good practice anyway. And the good news is you can adjust your Notes, and Cluster designations, as you go along. I actually believe we’ll get a better result with this DIY method, which we can easily tweak. I’m going to try it.

[22AD] Segment-ology: DIY Clustering TIDBIT by Jim Bartlett 20190218

14 thoughts on “D I Y Clustering”

Pingback: It is Iterative | segment-ology
JD4x4 on April 4, 2019 at 4:31 pm said:

Don’t know how I missed this post! I’ve been “clustering” on Ancestry since the end of 2015, and it was pivotal in finding my biological parents (with nothing other than Ancestry DNA matches).
It is essentially a spreadsheet pivot table where (to be manageable) a range of matches in a small range of estimated relationship (such as 1st-2nd cousins, etc) are shown in columns across the top, and the same range plus more distant are plotted down the left rows. The intersecting rows/columns contain a mark (count) when the column person is ICW the row person. Once plotted, groups of rows are sorted against the columns, resulting in a “matrix” of matches. Generally the paternal/maternal “split” (and beyond) within the matches is quite obvious.

My current method uses a data gathering process (the DNAGedcom Client) to gather the ICW data, and I query the Client’s SQLite db to generate a custom csv file that eliminates needing to merge & clean up the default Client csv output. If anyone really wants to get “their Geek on” and have a go, here is a rough tutorial of my (almost current, I’ve tweaked some queries) method.
https://drive.google.com/file/d/0B1s_CBMR_Ew3TmhxMk5xUmc1d3M/view

I have to give credit to Jon Masterson from the DNA Adoption Yahoo group for first suggesting this in mid 2015. (https://scruffyduck.screenstepslive.com/s/help_docs/d/qqfj4p)

LikeLike

Reply ↓
- jim4bartletts on April 4, 2019 at 5:29 pm said:
  
  JD4x4 – I still use this D I Y Clustering process – sort of. I first list any know TGs or CAs in the Note box of each Match. I then summarize the Notes from the Shared Matches (easily visible using MEDBetterDNA). This is tedious work (just like Triangulating is tedious work). Now, we have several automated Clustering Programs that basically take just a few clicks (no spreadsheet skills) – click at DNAGedcom to download the Match & ICW files, then click at DNAGedcom or Shared Clustering to generate a Cluster Matrix – Shared Clustering created a Cluster Matrix in less that 60 seconds, which included all of my 5,731 Family Finder Matches – generating 352 Clusters – each one on around Ancestor (whether or not I knew the Ancestor). I also Clustered all 3,571 of my AncestryDNA “4th cousins” (20cM threshold) and got 158 Clusters – the Shared Clustering program included all of my Notes, so I could look at each Cluster and quickly see if I had consensus on the TG and/or CA. All without any spreadsheet manipulation beyond scrolling down. It’s almost too easy. Your goals are toward a few clusters with close cousins to find bio parents; my goal is chromosome mapping to the edges of my genealogy – roughly 8th cousin level – requiring more clusters on more distant Ancestors. Jim
  
  LikeLike
  
  Reply ↓
  - JD4x4 on April 4, 2019 at 6:13 pm said:
    
    Yes, I messaged with both Rob W and Jonathan B when they were working on theirs, and it’s quick & easy but I like being able to pull out groups of ICWs in distinct ancestor clusters, and I’ve been doing it for so long now that it’s almost as fast for me as the generation of the block charts (once the data is gathered). I can see distinct clusters back to 5 gens and with Ancestry records I’ve “tree-triangulated” lots of matches in-between so I haven’t spent much time on segments. An as a bonus, I reunited with my 84 year old birth Mom. 😛
    
    LikeLike
  - jim4bartletts on April 4, 2019 at 8:44 pm said:
    
    Congratulations on reuniting. Jonathan and I did a lot of back and forth, also. He liked my data set, because I had every Match at FTDNA in a TG (or labeled false) and we could then easily verify the correlation between TGs and CGs (Cluster Groups). I spent some time with Rob at the FTDNA conference in Houston TX, and he has modified he program to allow the lower thresholds I want. And EJ and I have been communicating about the Genetic Affairs Clustering Program – it didn’t go deep enough for me; and there is some talk of incorporating TG info from MyHeritage and 23andMe. I also sat with the GEDmatch team in Houston, and they too are going to roll out a Clustering Program which will help us combine data from different companies.
    
    LikeLike
familyhistorian78 on February 21, 2019 at 1:40 pm said:

I put who the common ancestral couple are and which child of the couple that they are descended from. i do the same for half cousin relationships. with new matches that have no trees, or have private trees, if they cluster with a particular line, i will make a note of it. (ex: Smith line, etc) i also make a note if the match is also on 23&Me, FTDNA, My Heritage, or on multiple dna sites. for GedMatch, i put their kit# down.

LikeLike

Reply ↓
Amanda Cook on February 21, 2019 at 1:39 pm said:

I put who the common ancestral couple are and which child of the couple that they are descended from. i do the same for half cousin relationships. with new matches that have no trees, or have private trees, if they cluster with a particular line, i will make a note of it. (ex: Smith line, etc) i also make a note if the match is also on 23&Me, FTDNA, My Heritage, or on multiple dna sites. for GedMatch, i put their kit# down.

LikeLike

Reply ↓
gengenaus on February 19, 2019 at 2:06 am said:

Thanks for this Jim – it is very helpful indeed. I have a Mac and so many of the new genealogy third-party tools are not Mac-compatible and so I’m always looking for useful DIY tips. This is a very straightforward and well-reasoned system that makes the most of the very limited tools available on Ancestry.

LikeLiked by 1 person

Reply ↓
- jim4bartletts on February 19, 2019 at 6:41 am said:
  
  Thanks – that was the goal – something most folks could do – at their own pace – and in any direction they want – with the “tools” we are given at AncestryDNA. Jim
  
  LikeLike
  
  Reply ↓
Dana Leeds on February 19, 2019 at 1:06 am said:

I think this would work well for those who have already done a lot of work with their matches. However, part of the beauty of both manual and automated Match Clustering is that you don’t need to know anything about your matches – the DNA is really creating the clusters.

Here’s a suggestion – with the DIY method you could possibly use #1, #2, #3, etc, for groups you don’t know anything about. And, if you don’t know anything about any of the groups – perhaps because you’re an adoptee – you can use something generic like these numbers and then try to find the common surnames, ancestors, and places.

LikeLike

Reply ↓
- jim4bartletts on February 19, 2019 at 1:23 am said:
  
  Dana,
  
  I agree on both counts. However, if someone does Match Clustering without any idea of who is in each Cluster, it’s just 4 or 8 or 16 or more glumps of Matches – somehow we need to get a toe hold on an Ancestral line for each one. So I think there is some amount of “set up” everyone needs to do to get started with any kind of Clustering. I think D I Y Clustering, lets you start with only one or two toe holds and, through Shared Matches, build a Cluster.
  Any yes #C1, or #C2 would be good. In my Summary Report of Match Clustering, the table shows 65 (of my 151 Clusters) had no Notes. Part of that is because I have not been through all 3,000 of my “4C” Matches yet and tried to devine a #Cluster based on their Shared Matches, Also each of our Clusters usually has a lot of Matches which are “gray” and not in the Cluster – so the D I Y method may bring more Matches into #Clusters.
  I’m glad I have an automated Cluster Matrix to bounce off any D I Y Clustering. I’m also interested in a back-check on the automated algorithm. Jim
  
  LikeLike
  
  Reply ↓
Ernest Kapphahn on February 19, 2019 at 12:32 am said:

I’ve been using this DIY cluster matching for some time. Every day I look at the new AncestryDNA matches (usually about 3 pages worth) that are over 14 cM and try to determine based on my notes which group they belong in. I also look at smaller matches that have trees or unattached trees. The more matches that have notes, the easier this process will be. Looking at the “new” matches in size order,will put all those matches on top and the ones “good” or better will be most likely to have shared matches. Having the family line surnames in the notes also allows you to search the notes with the Ancestry Chrome extension and get a list of a given surname. Hopefully that will eventually be helpful with the “Tree Leap” function that Ancestry is beta testing.

LikeLike

Reply ↓
- jim4bartletts on February 19, 2019 at 12:42 am said:
  
  Ernest,
  I, too, have been filling the Notes boxes as fast as I can. And I’ve been summarizing the starred Shared Matches (those with a CA or TG). We can do that for any Match, even Pvt ones. I’ve been able to message many Matches and correctly guess our Common Ancestor. Now I’ll be adding a #Cluster designation, which I can then search for and all the Matches come up at once. It’s great!
  I’m a long time user and fan of Ancestry’s tools, but I’ve learned to be careful – I’m now getting a flood of UK baptisms and marriages, etc. based only on same name – not even same century, sometimes. It’ll be interesting to see how they “Leap”…
  Jim
  
  LikeLike
  
  Reply ↓
Jenny Franklin on February 19, 2019 at 12:15 am said:

Jim I am an avid DIY cluster maker and have spent some time capturing the process as a set of rules and procedures to share with friends, but I will say few people seem willing to commit to the systematic, evolving effort required to see it paying off. One approach that I take is to start by demonstrating the process for just one or very few chromosomes where well known cousins with well understood lines of descent reside. Once you work through that, both the productivity AND the commitment becomes much clearer. It also can give a clearer view of which tools work best. This can help end run the burn out that happens when people encounter the scope of information and ambiguity of solving segments and feel overwhelmed.

LikeLiked by 1 person

Reply ↓