Manual Clustering to Find Ancestors

Below I will outline a process to find a Target Ancestor (TA) – often a Bio-Ancestor, or a brick-wall Ancestor, or maybe to confirm an “iffy” Ancestor.  This is a follow on to Manual Clustering From the Bottom Up. But first, here is a little background.

DNA – We all get exactly 1/2 of our autosomal DNA (atDNA) from each parent. Pretty close to 1/4 from each grandparent ; 1/8 from each Great grandparent; etc. Yes, beyond parents, these fractions are not exact, but for genealogy they are pretty close. The point is that for several generations going back, we get a lot of DNA from each Ancestor – and roughly the same amount from each one in any generation.

Matches – All things being equal, we would get roughly 1/2 of our Matches through each parent; 1/4 through each grandparent; etc. But all things are often not equal:

            1. We tend to get fewer Matches from Ancestors who are recent immigrants (say in the last 4-5 generations). It’s because most test takers are Americans.

            2. We will get fewer Matches from “skinny” families – an Ancestor who had 2 children will have far few descendants (and Matches) than an Ancestor who had 15 children.

            3. Endogamy results in many more Matches than usual (think Jewish, Mennonite, Polynesian, etc.)

Each of these factors will unbalance, or skew, the number of Matches we get for each Ancestor.

For the purposes of this process, I’m going to assume that our Matches are generally spread fairly evenly over our bio-Tree. To the extent this isn’t true in your case, this process may be more complex, or it may not even work.

When we are looking for a TA, the concept is that a good chunk of our DNA came from that Ancestor (depending on the number of generations back), and a good chunk of our Matches will be cousins – either to or through that Ancestor.  Although the Ancestor is not known to us, the TA did have 2 parents, 4 grandparents, etc., and those more distant Ancestors may be well known to our Matches. NB: an immigrant Ancestor may throw us a curve ball here.

The process overview is:

1. Group Matches;

2. Find the Common Ancestor (CA) in each group;

3. Build down from the CA to find links between groups (usually, but not always, a marriage);

4. Build down from those couples;

5. Repeat as necessary (usually down to parents or grandparents of the TA);

6. The end-game may involve date and location issues and further DNA target testing to isolate and identify the TA.

More step-by-step details:

Step 1: Group Matches – this is basically Manual Clustering at AncestryDNA. Start with the Match list from 400cM down. How far down depends on the generation of the bio/brick -wall Target Ancestor (TA). You want to go back 2 more generations. So, if the TA is a one of 4 grandparents (1C level), you’ll want Matches who are 3C level, say a 50cM lower threshold. You want 16 groups. NB: if you can filter out some Matches – say you know one “side” and can identify those Matches, you can cut the groups to 8. And you may be able to quickly identify 4 of those groups to a known grandparent. This would leave you with the 4 groups that represent the 4 grandparents of the TA.

1A. List these Matches in a spreadsheet or write on a piece of paper

1B. Select a Match about 3/4 of the way down the list [avoid starting at the top!]

1C. Open that Match’s Shared Match (SM) list

1D. Put an A next to that Match and each Match on your List who is on the SM list. This forms a Manual Cluster A, which tends to have a Common Ancestor (CA).

1E. Start over at Step 1B, selecting a Match who is not A, and use B.

1F. Repeat as often as necessary, using new letters, until all Matches on your list have at least one letter. NB: The Matches at the top of your list may wind up with multiple letters.

1G. If lower cM Matches have multiple letters, review their SM list – usually one of the letters is a one-of-a-kind and that letter can be deleted. If there is a lot of overlap, between two letters, they can be combined, using one letter.   Use judgment.

Usually this Step can be done in a few hours.

Step 2: Find CA for each Group – this takes some poking around…

2A. Select a Group, and open any available Trees (including Unlinked Trees)

2B. Type/write next to the Match the closest 10-15 surnames

2C. Repeat for as many Matches in the Group as possible

2D. Look/search among the surnames for common surnames

2E. Open the Match Trees and select Ancestor information with the common surnames

2F. Analyze and record the probable Common Ancestor for the Group [if necessary, look at more SMs for the Group Matches for confirmation of the CA]

2G. Repeat for each Group

2H. Note the place/date-range for each CA [these may be a clue to links between Groups]

This Step will take a little longer, depending on whether you want a quick result, or if you want to document the CA with records for the longer term.

If you recognize some Groups as being from known Ancestors in your Tree, they can be set aside. Ideally you want to end up with 4 Groups who would represent the 4 grandparents of the TA.

Step 3. Build down from the CA to find links between Groups – a genealogy exercise…

3A. Use genealogy tools to list the children, spouses, and grandchildren of the CA

3B. Pay particular attention to dates and places

3C. Sometimes a marriage between Groups will pop right up; but sometimes it takes a process of elimination (dates/places help here). It’s possible the bio-parents were not married; or other scenarios. You’ve narrowed the possibilities down a lot, but sometimes, there just isn’t a record of what really happened.

3D. Repeat for other groups

3E. Once you have linked some Groups (by marriage or by place/date or by ethnicity, etc), this helps link the remaining Groups.

If records exist, this Step may follow relatively easily; if not, follow-on DNA testing may be necessary

Bottom line: This Step will provide some family lines that are ancestral to the TA. The top DNA Matches have led you to specific CAs. These may, or may not, mesh with information you already had.

Steps 4 & 5 – see Step 3

Step 6. The end-game – This may involve date and location issues and further DNA target testing to isolate and identify the TA. The best solution is that the TA is obvious. However, sometimes the TA is still buried, but you are somewhat closer.

Sidebar: I do this manually in Excel and Word. It is possible to use one of the auto-Cluster programs to group the Matches. However, I prefer “getting to know” the Matches and their Trees, and this process is fairly straightforward. It also lets me see any overlap between groups. I prefer to manually Cluster for a targeted case. I use the auto-Cluster programs when I’m grouping my entire Match list.

In one recent case I did, the marriage between two Groups popped right up – no secrets. The children included 5 potential sons as the bio-father. All five from PA, went into WWII, 4 came back to PA; and one settled in another state a few blocks from the bio-mother!! We’d never have sorted this out without the process above. And luck is sometimes the key factor.

In another case I’ve worked on for years, I used SMs and records to group many Matches on 8 Great grandparents, and 4 grandparents of the TA. Places and dates all work out, and all the top Matches are in agreement. WATO points to the same place. It now appears the father and mother were not married, and both of them apparently died without out any other issue, or any records.  It’s frustrating to have basically 100% of the Matches all pointing at the same TA, but without revealing the parent’s names. No luck on this one – just a lot of work. Maybe someday a Newspaper article from the 1880s will shed some light. The DNA can only do so much…

SUMMARY – The process above is my current best practice to squeeze out what I can from Matches and Shared Matches at AncestryDNA. This whole process can be done on notebook paper, in a relatively short time, but I still prefer Excel. Note that the process does not depend on knowing any genealogy of the TA, it relies totally on information from Matches and Shared Matches. Hopefully the TA, the last puzzle piece, “fits”.

[19M] Segment-ology: Manual Clustering to Find Ancestors by Jim Bartlett 20220226

25 thoughts on “Manual Clustering to Find Ancestors

  1. Hi Jim
    Thank you so much for a really helpful post.
    Trying to find a TA who’s my dad’s ‘missing’ 2 x great grandfather, William Evans (we know he died between 1841 and 1851 and was born around 1770 but that’s it!) So following instructions I need to start 2 generations back from the TA, which means I have to list my dad’s 5th cousins.
    A question; if you would be so kind?
    Does that mean matches in the 4th to 6th cousin range or do I also need to include the matches in 5th to 8th cousin range? I ask as my dad has already has 442 matches in the 4th to 6th cousin range (22-64cM) and I’ve not even counted the 5th to 8th cousin matches! So I don’t want to start at the wrong place!

    As additional info: In the 4th to 6th cousin range, there are 16 matches with a ‘common ancestor’ which pan out, and these are spread across my dad’s paternal and maternal lines. But in none of these are his 5th cousins – they are mostly 2nd cousins twice or three times removed or 3rd cousins once or twice removed.

    And with the ‘common ancestors’ in the 5th to 8th cousin range, there are around 60. But with the one’s I’ve verified, most are 3rd cousins three times removed or 4th cousins once or twice removed; just two 5th cousins, both once removed.
    (Also as my dad’s family are mostly Welsh, the patronymic naming system was in place so there’s little chance of clustering around any common surnames,)
    Thank you.

    Liz

    Like

    • Liz, Thanks for your post. This is an emerging area, and we’re developing the process as we go. Are you using your Dad’s Matches or your own? In any case the point is to use Matches whose Common Ancestors will be one or two generations back from the TA. You don’t know the TA, and thus wouldn’t recognize him if you saw him. So the point is to find the TA’s parents and/or grandparents. At this distance, the DNA is widely spread, and the cMs and cousinships overlap each other a lot – it’s difficult to specify a definite range. You need known Matches who share the TA’s child as their Common Ancestor. Then work with their Shared Matches (who are likely to be on the same line). Try to find Common Ancestors among those Shared Matches. This helps establish a group of Matches that are pointed in the right direction. Please let us know if this helps. Jim

      Liked by 1 person

      • Hi Jim

        I’m using my dad’s DNA.

        Unsure if this is fatal but there are no known matches with TAs son (b. abt 1820 & 1x great grandfather) alone; just a group of matches with both TAs son and his wife – all descendents of the siblings of my dad’s grandfather, b 1841.

        I was able to ‘build trees’ for some matches, recognised common ancestors – TA’s son and wife – built a few more trees then created a core group of 5. (They are all my dad’s 2nd cousins twice removed). Then I created a larger group of all their shared matches – 16 in all – but no one else has a fledgling tree for me to build upon.

        More as I’ve only just realised that Ancestry won’t show shared matches below 20cM, those 16 shared matches is where I’m at. (Of course no one is on Gedmatch but I find the gedmatches are low cMs usually with US gedcoms, if any, and my lot have never left Wales!)

        Suspect I may need to do a rethink in case I’ve already fast forwarded to the ‘end-game’ you refer to in your article
        But very much appreciate what you wrote there and for taking the time to reply to me.

        Liz

        Like

      • Liz, a few more thoughts.
        1. Your father has 16 2xG grandparents – the TA is 1/16 of this – so, roughly, 1/16 of all of your father’s Matches should have shared DNA that came through the TA. The problem is finding them. Sometimes ethnicity can help, if the TA line had a special ethnicity. Sometimes other Matches (and their Shared Matches) can be identified with a different line than the TA – you can give them all a non-TA “dot”.
        2. Look at the Shared Matches of the Shared Matches – some will be repeats, but some new ones may pop up. Give them all a TA “dot color”.
        3. Now go through the remaining Matches (which don’t have a dot for TA or non-TA) and see if you can form groups with them. As you analyze those groups you might find they are TA or non-TA.
        4. At GEDmatch – pay $10 for one month of Tier 1 – then run an AutoSegment Cluster on the top 500 or 1000 or 2000 Matches, and see if you can pick out one or more TA Clusters to work with. Not many Trees at GEDmatch, so you may have to look for Ancestry kits in the Clusters…
        These are all ways to “fish” – and like real fishing, we are not guaranteed to catch a fish. Another tactic is to upload your father’s raw DNA to FTDNA and MyHeritage – for a different mix of Matches…

        It’s interesting, and concerning, that you cannot find any Matches who go back to other children of TA. I presume you’ve searched all of the Matches at Ancestry for that surame… Jim

        Liked by 1 person

  2. Jim, I appreciate your insight and step by step instructions, including the various caveats. For match grouping I believe there is one more important consideration that is all too often overlooked. This is the AGE DIFFERENCES between the tested person and their matches. Your grouping idea, to start with lower matches or 3/4 down the list is different or arbitrary approach to correct for the age problem. Basically, the approaches of grouping by cM’s to identify common ancestors (yours here or the Leeds Method) ASSUMES the test taker and the matches are in the same generation. Years ago this was a pretty good assumption since most test takers were of similar ages. Today they are not. A great many much younger and much older testers are in the databases. What this means is the old range of 90-400 cM designed to identify someone’s 2C’s sharing a common grandparent, now, quite often, includes 1C once, twice or three times removed. Or great nieces/nephews or assorted half relationships, all of which share multiple grandparents (2,3 or all 4). To help alleviate this problem I now advise looking at age differences first when trying to group your matches and remove anyone that is ( or appears to be) 35 years or more younger than the test taker. Certainly this is far from perfect, but it has provided much better grouping success. These younger matches can then be added back as a second step, where they often fall into two or more of the groups that have been separated.

    Like

    • James – Thanks for your post – I agee with your insight about the relative ages of Matches. Clearly when the ages of the test taker (I am 78), and their Matches differ by a generation or more, the cousinship will likely involve “removes”. Also each person needs to look at their own tree for generational differences, too.

      I acknowledge that grouping processes (Triangulated Groups, Clustering (manual or auto), DNA Painter, etc) do not take these generational differences into account. They are not designed to. All of these grouping processes are designed to group Matches along an ancestral line. In general, most of the Matches in a group will have the same Common Ancestor.

      However, my MAIN POINT, is to FORM the Groups, and then WORK with individual groups (like the family they are) to pinpoint the Common Ancestor. At this point, this is a genealogy task, and relative ages, when known, should be taken into account. IMO, the problems we have are not with the DNA, it’s with the dearth of good/full Trees.

      Having worked forming hundreds of groups, my recommendation for finding bio-Ancestors, is still to start with a list of your closest Matches, and work from the bottom third up. IMO, this forms the different groups more clearly and quickly, with the smaller cM Matches in your list. This then lets us analyze the largest Matches and the several groups they may align with.

      I think this is very similar to your method – taking the closer Matches into account at the end of the process.

      Again, thanks for spotlighting this as a special area to take into account. Jim

      Like

  3. Thanks for a very interesting post! I immediately tried this for a person whose grandfather is unknown. I listed his matches down to 50 cM – 59 people – the best match shares 156 cM. I got 18 groups, but there is VERY much overlap even in the lower cM range. Only 24 of the matches are in only one group, the rest are in 2-6 groups. It seems difficult to know which groups can be combined, four of them seem possible to combine but the rest give me a headache… I think I’m stuck here. Oh well, it was worth a try.

    Like

      • I did as you suggested – started about 3/4 down the list, at match number 44 of 69. After grouping that first group, I continued with the first match from the top that didn’t already have a letter. Was that the right way to do it? I have also tried MyHeritage’s auto clustering, and found that the people that I now got in the four groups that may be able to be combined, are put in one big cluster by the auto clustering. I also know their MRCA couple, but not how my test taker is related to them.

        Like

      • Christina, I think you already guessed the answer. Stay at the bottom part of the list and work up. The top Matches should be last. I’m going to paste here an “epiphany” that I still need to clean up, add a diagram, and post, but I hope you get the idea: Work the Clustering from the smaller cMs up – DO NOT start at the top (largest cM) and work down. The Matches at the top of the list will be closer cousins, and probably include several of our 1xG or 2xG or 3xG grandparents. The Matches at the bottom of the list are much more likely to be related from only one of these 1, 2 or 3xG grandparents – that’s what we want: groups of Matches around more distant Ancestors. Ideally 8 Groups (one for each 1xG grandparent – or two for each 1xG grandparent couple) or 16 Groups, similarly spread across our ancestry. By working from the bottom up, we tend to isolate those different Groups. And we tend to see the topmost Matches also sharing in more than one Group – this is OK – it’s expected – actually it’s powerful! The DNA is talking to us. When the topmost Matches share in 2 groups, this means that those two groups are very likely to be related by marriage. Think about it. When we can find 8 Groups with the smaller Matches, they probably represent 8 1xG grandparents – we want to find out how they are interrelated to be the parents of our 4 grandparents; who are in turn interrelated to be the our 2 parents! The topmost Matches then shout to us how these groups are interrelated! They give us a focus of where to look for that intermarriage (or mating). All of a sudden, we have a somewhat clearer picture of the Ancestry of the bio-Ancestor – our Target Ancestor (TA).

        Like

      • Oh, so I should have gone from the bottom up all the way? Then I don’t understand why I should start 3/4 down the list? Why not from the very bottom? And what did you mean by “Start over at A2, selecting a Match who is not A, and use B.”? I thought by A2 you meant cell A2 in the spreadsheet, so I did group B with the first person from the top who wasn’t in group A.

        Maybe this is a case of endogamy or pedigree collapse, I don’t know. Would it be better to group people who actually triangulate with my test taker, and not include everyone who matches?

        Like

      • Christina, In my testing of this process, it took longer to start at the bottom. For some reason, some of those Matches didn’t reach very far up, meaning I had to do more iterations. When I started a little furhter up the list, it just seemed to go faster. But you are free to process the Matches in any order you want, just save the topmost Matches for last. It’s the same reasoning why the LEEDS method does not include Matches over 300cM – they cloud the grouping.

        In the process steps, I meant to start over at Step 1B, not cell A2. [this was my garble, sorry, fixed now, thanks for pointing it out]

        This process is primarily aimed at Ancestry – we cannot Triangulate there, and they have more/better Trees. If you are working from a company with a browser, I’d always use segment Triangulation (in fact all of my Match-segments are already in TGs) – the problem is finding Tree information at those companies. I don’t like to blame endogame or pedigree collapse – I want to find a way, in spite of those issues (segment Triangulation is one of those ways).

        I’m glad to see you working this issue – I’m trying to squeeze out as much as I can from the huge amount of data we’re given. I think you, too, are on the “front lines” of this effort. Thanks, Jim

        Like

    • Thanks! I have now tried grouping from the bottom up and got 15 groups. There is still quite a lot of overlap, but not as much as before, now no one is in more than 4 groups, which was 6 before. And I also noticed that with these new groups, more people in the group share higher amounts of DNA with the group’s “start person”. That suggests to me that the groups are now more accurate than before. I haven’t looked more closely at possible combinations of groups yet, I’ll do that later. I’ll also try one version with only triangulation and see if that reduces the overlap even more.

      Like

      • Christina – Great. I didn’t go a lot into combining groups. If there is a lot of overlap, they should be combined. Just like in the Clustering diagrams, not every Match will share with every other Match in a Cluster (the Clusters are rarely solid blocks) – so there is some judgment here. In most groups, most Matches will share with most of the others, but not all.
        This also becomes more clear in Step 2 when you find Common surnames (and in some cases common locations) – in one case I had two groups from Ashe Co, NC and descendants of both went to Floyd Co, KY – pretty certain they intermarried; and two other groups were in Roane Co, WV for several generations – pretty certain they are linked.

        Like

    • Well, common surnames is a whole different challenge when you’re Swedish since we didn’t have a lot of inherited surnames until around 1900. But nothing is impossible, it just takes longer to solve. 🙂

      Like

  4. Jim,

    Your timing on this post couldn’t be better!

    I’m working on a project right now where my target ancestor is a 3rd GGFather. I’m in the process of laying out my research plan for this project right now, so it was like this post fell out of the sky right when I needed it.

    I’ve got the help of a three ascendant 3C2R, which will help immensely. Although, the genealogy Gods have balanced this gift with the fact that I’m looking for the parent of an Irish Immigrant.

    I’ll let you know how this unfolds!

    Mark.

    Like

  5. Pingback: Download DNA Matches from Ancestry . com | Monterey County Genealogy Society

  6. Pingback: Best of the Genea-Blogs - 20 to 26 February 2022 - Search My Tribe News

  7. Pingback: Keeping up with the DNA Discussion group | Monterey County Genealogy Society

  8. Do you think this will work when the TA is a 4th Grt-GP? I am looking at matches with around 25cM in common with me. I’ve gone a generation, or two, back on those with trees, but the problem is Ancestry does not show shared matches below 20cM. I am thinking I will need to ask some of the matches to upload at GEDmatch, or share their id’s with me if they already have. What do you think? any advice? Thanks for sharing your techniques.

    Like

    • Barb – Yes I do! It gets a little harder with each generation back . A lot depends on the factors I outlined. I blogged about my 38P Ancestor: Thomas NEWLON and his mysterious wife (the TA) – they are at the 4C level. I had several 2C and 3C Matches pointing to him, and a lot of Matches with no surnames I recognized – this was all in one TG, [01S24]. I found several of them with CUMMINGS Ancestry, found a bunch at Ancestry, back to 8C, and when I enquired of Matches at 23andMe and MyHeritage in that TG, I was drinking through a fire hose. I now have over 50 Matches with CUMMINGs Ancestors. Many of them are all over the place, with flimsy ties back to VA and then to Scotland, but I found other Matches with good Trees back to Westmoreland Co, VA in 1600s and I could build a Tree that include almost all of my Matches. I was lucky to have that TG go back on 5 generations of CUMMINGs surname. It would have been much harder if that TG had gone back on the spouse of the first CUMMINGS (fewer CUMMINGs Matches in that case). The other thing I would add is that folks have spent years looking for their brick wall – this process can be done in a day or two! I did one with a friend of mine in one bottle of wine/afternoon – I called out the names and she wrote them down and noted the duplicates. We quickly got the two CAs that married and a Tree with all their children. It’s not always that easy, but it seems to me after lots of effort over years, this process would be worth a few days.
      Two other tips:
      1. 10cM and 15cM Matches can have 20cM+ SMs! So be sure to check them.
      2. Once you get a surname clue or possibility, search in your DNA Matches for that surname – the search will pick it up in Unlisted Trees (and Private Trees) which is VERY helpful if they are among the Shared Matches.
      Hope this helps – please provide some feedback if you try this. Jim

      Liked by 1 person

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.