Walking the Clusters Back (WTCB) 2022

An Advanced Segment-ology Topic

Introduction.

This will be a longer and more detailed post than usual. The process I’ll outline takes a lot of precise and detailed work. And preparation work. You have to decide if it’s worth it for your objectives.

I’ve tried several blog posts about Walking The Clusters Back. In my opinion, they all failed. I was trying to find a sweet spot that would give us groups of Matches at each generation. That generally works at the grandparent level (the Leeds Method works in most cases to provide 4 columns for 4 grandparents), but the Clusters quickly get jumbled up as the cM Threshold decreases. I should have known better. Each Cluster still tends to focus on an Ancestor, but the different Clusters have Ancestors of different generations. The Clusters sort of mirror the Shared cM Project – as the cM value decreases, the Shared DNA Segments come from a wider and wider range of generations. The overlap of possible relationships grows. The Shared cM Segment pattern gets more and more jumbled – just like the Clusters.

So if we can’t use brute force on the data, lets go with the flow, and develop a process that tracks Clusters – by tracking the Matches in them. As the cM Threshold is decreased, the number of Matches being Clustered increases. This results, generally, in more Clusters with more Matches in them.

Overview

Overall WTCB Process:

1. Run a Cluster report with a high cM Threshold (say 80 or 90cM) to get at least four Clusters that you can identify a consensus Most Recent Common Ancestor (MRCA) in each Cluster.

2. From information in the Match Notes, determine the consensus Root Ancestors (RA) in each Cluster. The RAs start with your parent, and include your Ancestors out to, but not including, the consensus MRCA for each Cluster. These RAs should “fit” all the known Matches in the Cluster.

3. Impute (copy) these RAs to all the other Matches in the Cluster.

4. Repeat for all Clusters

5. Run a new Cluster report with a lower Threshold.

6. Combine the Matches from the previous and the new Cluster reports into one spreadsheet.

7. Sort on Match names.

8. Merge the duplicate Matches into one Match (much more on this step later)

9. Return to step 2, and continue…

10. Gradually reduce the upper cM Threshold, to cull out the closest Matches – this fine tunes the MRCA of the Cluster.

As part of this overview, I must provide a warning:  there is a lot of homework required before you can start this process – see the Homework section below. The Cluster runs include the Notes for each Match. These Notes should include  “known” Match MRCAs and cousinships [including multiple MRCAs], and any TG IDs, that  you have determined. This is the information you need to populate your Master WTCB Spreadsheet. This is your source for RAs.

In the paragraphs to follow, I’ll offer a spreadsheet template, and specific steps to accomplish the steps in each Cluster run. It’s a repetitive process that I have tweaked to make it as standard and efficient as I can. The number of Matches about doubles with each Cluster run – so the work gets harder. I’ve also incorporated my short cuts and tips into the steps…

I started with an 80cM Threshold and found 8 Clusters – I was confident I knew the consensus MRCA for each Cluster. More importantly, I knew the Root Ancestors for each Cluster (the RA being the parent and grandparent and sometimes more) back to, but not including, the MRCA). As I lowered the cM Threshold (usually by 10cM at first) and ran a new Cluster run, I found the number of Matches about doubled and the number of Clusters increased. The increases were not in a predictable way, but the Clusters grew in size (more Matches) and slowly, but surely, pushed the RAs out to more distant MRCAs.  

I’m now confident this process works. By that I mean for each new WTCB Cluster, we get some RAs which point to the MRCA of the Cluster; and this MRCA is very close to the MRCA we’ll find with each of our Matches in the Cluster. A strong, helpful, clue…

Homework

Some *essential* homework is required before you try this:

ANCESTRYDNA TREE & MATCH NOTES:

1. Test at Ancestry and build your Tree out (as much as you can to 7xG grandparents where possible [you only need Ancestors (use standard names); birth/death dates/places]. AncestryDNA needs this information for ThruLines to work. See some of my posts on ThruLines starting here.

2. Link yourself to your Tree – this let’s AncestryDNA do it’s magic with ThruLines and other hints.

3. Find as many MRCAs as you can – some are close low hanging fruit; many will be via ThruLines (which will find MRCAs in Private, but searchable, Trees); some you’ll find in Unlinked Trees (which ThruLines does not review).

4. Add what you find to the Notes of your Matches – see my blogpost here.

5. Review: It Is Iterative here – the goal is to get info into the Notes of your Matches.

6. It is very important that you have information in the Notes of as many Matches over 20cM as possible.

SET UP DNAGEDcom CLIENT:

1. Subscribe to DNAGEDcom Client (DGC) (you can subscribe for one month to try it out). See links in this blogpost.

2. Click on the DCG Icon and Log in.

3. Set up your folder (you’ll access this folder regularly in the WTCB process)

IMPORTANTDo not go beyond this point until you have completed the ANCESTRY TREE & MATCH NOTES Homework – we need the data in the Match Notes before we gather it in the next step!

4. Gather Matches and ICW from 20cM to 400cM (ignore Trees and Ethnicity for now – they are not needed for this WTCB process, and they can always be gathered later). This gathering process may take a day or more (depending on the number of Matches you have). I think the % Complete indicator is based on gathering all of your Match, so it may be misleading, and the gathering process will finish somewhat sooner.

5. This process will store several files on your computer:

        a. m_yourname CSV file of your Matches with lots of information about each one, including your Notes, URLs to the Match and their Tree, Shared cMs, etc. This file is a gold mine by itself – I highly recommend you save a Working Copy of this file in Excel – it’s very useful.

        b. icw_yourname CSV file – this is a large file used by the Clustering program

        c. DNAGedcom Data Base File – where all the data is stored

6. The Clustering reports are run separately. Each run takes about a minute (not a typo), and produces 3 reports:

        a. clm3d_yourname_[date,time,threshold string]_clusters CSV file – a list

        b.  clm3d_yourname_[date,time,threshold string] Excel file – includes a TAB you’ll use. [I make a copy of this file – appending the word “Working” – to use in WTCB.

        c. clm3d_yourname_ date,time,threshold string]  HTML doc – the colorful display.

SET UP YOUR MASTER WTCB SPREADSHEET

The last part of our homework assignment is to set up a Master WTCB spreadsheet template.

There are 3 features about this spreadsheet template:

1. It is a tool, to incrementally follow and interpret the data.

2. It is the culmination of many variations I have tried. It is fairly easy to set up, and it offers a lot of flexibility.

2. A standard spreadsheet will help me explain the various steps later in this post. Of course, you are free to use any format you want. In fact, I encourage feedback on improvements to this Master spreadsheet, or the whole WTCB process.

Here is a sample of my Master WTCB Spreadsheet with some data:

Notes:

1. This is from the initial CL 80cM Cluster run, and there are columns to the right for the 70cM Cluster results from the next run.

2. I have Notes for all of these close Matches – they were in the AncestryDNA Match Notes and then captured by the DGC Cluster program.

3. The data in the known columns was from the Notes

4. The data in the ROOT ANCESTORS columns was derived from the known data, and then imputed (copied) to the other Matches in the Cluster.

Teaser:  these 5 Clusters have Root Ancestors from three of the four grandparents (4, 6 and 7) and CL 3 and CL 4 appear to be splitting to Great-grandparents 8 and 9. The Walk has started!

Here is a list of the 10 columns from the DGC Cluster run spreadsheet and where 9 of them go in the Master spreadsheet:

Note: the CL [B] and Super CL [C] columns copy to different columns in the Master spreadsheet, depending on the Cluster cM Threshold.

Here is a list of the 49 columns from the Master spreadsheet – with a brief description of each:

This covers the Homework section. Get ready to Walk The Clusters Back…

WTCB Master Spreadsheet overview:

Let’s divide the process into several stages for each Cluster run – details later:

1. Run a Cluster report at DGC; copy the data to the Master Spreadsheet; do some additional housekeeping chores to get the Master Spreadsheet Ready.

2. Merge duplicate Match rows (not with the initial run, but needed with subsequent runs, after the previous Matches are added to the spreadsheet)

3. Sort the Master Spreadsheet to show the Cluster Groups.

4. Type information into columns L, M and N from the Match Notes (when available).

5. Analyze each cluster, and fill in Root Ancestors (RAs) for all (in columns F-K as needed)

6. Save Master WTCB Spreadsheet for this run.

There are some other “details” I’ll explain as I expand on each of these stages below.

WTCB Master Spreadsheet details:

Here are the details for each stage:

1. Run a Cluster report at DGC; copy the data to the Master Spreadsheet; do some additional housekeeping chores to get the Master Spreadsheet Ready.

        a. At DGC, click on the Autosomal TAB and select the Collins Leeds Method (CLM).

        b. Select the Thresholds (start with 80cM and subtract 10cM for each of the next few runs); leave the upper limit at 400cM, and reduce that in later runs. I uncheck Paint Midline & Include Ancestors. Then click on the Run Grouping bar. It takes about a minute to produce the three files.

        c. Open the file:  clm3d_yourname_[date,time,threshold string] Excel file. I make a copy of this file – appending the word “Working” and save it in Excel format. Open the second TAB labeled Data.

        d. Open the Master Spreadsheet and save it with the cM Threshold number (e.g. 80cM) append to the file name.

        e For the next 4 steps – make sure you copy to the same blank row at the bottom of your Master Spreadsheet, so the columns line up properly with the Matches.

        f. Copy columns B and C to the appropriate Master spreadsheet columns (this would be Q and R for the first 80cM run – it shifts with each subsequent run)

        g Copy columns D, E and F to Master columns A, B and C

        h. Copy columns G , H and I to Master columns AU, AV and AW

        i. Copy column J to Master column P

        j. In the appropriate order column [S for the first run], type a 1 for the first Match, then drag this down to the last Match to create a series. [I sometimes want to recreate the original Cluster order]

        k. Use a new row to create a Header. In column O type: CLUSTER 80cM (or whatever the cM Threshold is for that run). Type 0 in the appropriate CL column and 0 in the appropriate order column. Yellow highlight this row.

        l. Use another new row to create another Header. In column P type: CLUSTER RUN 80cM (or whatever the cM Threshold is for that run). Type 1 in the appropriate CL column and 0 in the order column. Highlight this row in light grey. Copy this row so there is one for each Cluster. Drag the 1 in the CL column down to fill in the series – this provides a numbered header for each Cluster.

2. Merge duplicate Match rows. Note: this step is not used for the first (80cM) Cluster run – there is only one set of Matches, so merging is not required. In subsequent runs, the prior Matches are in the Master Spreadsheet (all those above 80cM, in the second run), and all the Matches from the new Cluster run will be added to the Master Spreadsheet (all those above 70cM). This means the prior Matches (above 80cM) will be duplicated, but in the new run they will only have Cluster data from the new (70cM) run in the CL and CL Super columns. This step will merge the duplicated Matches into one row with prior and new CL and Super CL data (and the order numbers); and delete the other row.  Here we go…

        a. Tip – sort the Spreadsheet by C [cM]. This puts all of the new, smaller cM, Matches at the top.

        b. Highlight the rest of the Match rows and sort by Name and cM and CL [for the current run]. This puts all the duplicate Matches together, with the ones from the new run (with a value in the CL column) on top.

        c. a view of the spreadsheet at this point – showing the duplicate Matches on the left, and the CL 70, Super, and Order data that needs to be copied down one row. Notice also that the Matches below 80cM are still in this sort. This is OK, but be careful dragging the data down. By using the Tip above, these can be sorted out, which makes this merging step a little easier.

        c. Copy (or drag) the 3 cells (CL, CL Super and order number data) down one row and paste it into the same columns (to add it to the duplicate Match who already has some data from previous runs). Then delete the Match which you just copied from. [Tip: an alternative to deleting the rows one-by-one, is to type an x in the order cell of the top Match and later sort the spreadsheet by that column and deleting all the rows with an x.]

        d. Continue to the bottom of the Matches in this Cluster run – a boring task .

3. Sort the Master Spreadsheet to show the Cluster Groups.

        a. Highlight all the rows of the spreadsheet (under the main header) and sort on CL and order – columns Q and S in the first run – it shifts to the right with subsequent Cluster runs.

        b. You should now have a nice looking WTCB Master Spreadsheet with a group of Matches under each grey CLUSTER RUN Header – see the sample above.  You’re ready to start working with the data.

4. Type information into columns L, M and N from the Match Notes (when available).

        a. Work down the spreadsheet, looking at the information in the Notes. For Matches with known MRCAs and/or TGs, type in the MRCA Ahnen in column L; cousinship in Column M and TG ID in column N

        b. This is populating the Master Spreadsheet with data from the Notes – this is why the Notes Homework (before running the DGC gathering program) is so important.

5. Analyze each cluster, and fill in Root Ancestors (RAs) for all (in columns F-K as needed)

        a. This is where the Ahnen system shines (using numbers instead of typing out the Ancestor names); and descendants are always half the father’s Ahnentafel, so we can easily work from the MRCA Ahnen back down to a parent)

        b.  Fill in the Root Ancestor Ahnentafel numbers where an MRCA is known. Note: The “Root Ancestor(s)” are the closest ones to you – NOT including the MRCA Couple. The basic RA is a parent – using Ahnentafels, this would be a 2 or 3 (father or mother). The next RA (at Generation 3) must be a grandparent – a 4, 5, 6, or 7 – 4 and 5 are the parents of 2; 6 and 7 are the parents of 3. The line of descent (and most probably the shared DNA segment) comes from the MRCA to you along this path.

        c. Use judgment to determine the consensus RAs that would apply to all the Matches with known MRCAs. Note: if there is a Match who is clearly inconsistent with the rest, ignore or move that Match row (to a different Cluster or a “time out” area at the bottom of the spreadsheet).

        d. Copy these consensus RAs to all the Matches in the Cluster. The concept here is that each Cluster is formed around an Ancestor, and that all the Matches would have these same RAs. The stronger the Cluster consensus is, the stronger the case for the same RAs. There may be some Match anomalies, but by Walking The Clusters Back, I’ve found that the same RAs are almost always consistent.

        e. A very few Matches in a Cluster may be at odds with the consensus. This may be due to an incorrect MRCA (it happens to me and to ThruLines). It may also be due to the Matches having multiple MRCAs with me, and/or multiple segments. Check the DGC HTML file to see if there are grey cells that link the Match to another Cluster(s). When a Match appears to me to be “better” in a linked Cluster, I move their row to that other Cluster in the spreadsheet and change the CL number to match (I leave the order number in case I want relook at the original Cluster list.)

6. Save Master WTCB Spreadsheet for this run. For each Cluster run, I save the Master Spreadsheet with a new descriptor added to the file name – like 70cM.

This is a repetitive process – go back to #1, and run a new Cluster report – keep going….

Objectives of WTCB

Identify root ancestors for Clusters, and, by inference for all the Matches in them. This provides a pointer when investigating any Clustered Match. It gives direction (names, dates, places) when building a Match’s Tree back; to finding an MRCA with any Match; to researching Brick Walls.

Notes/Observations:

1. Start with a high cM Threshold, say 80cM or 90cM. I have found that reducing the cM threshold by 10cM about doubles the number of Matches in the next run – to a point. The shift from a 50cM threshold to a 40cM threshold added much more than double – so I back tracked and started using a 5cM reduction to get a 45cM run. Similarly when I got to the 30cM range, I then reduced by 4cM, then 3cM, then 2cM (for a final run with a 20cM threshold.

2. A very few Matches turned out to be anomalies – they did not “fit” in the Cluster they were assigned by DGC, based on the MRCA we had. If they had a grey cell link to another Cluster with a good fit, I moved them to that Cluster. If they didn’t appear to fit any grey cell Cluster, I moved them to a “time out” section at the bottom of the spreadsheet, with an X in the CL column. These very few Matches probably had an issue with the MRCA, that I needed to investigate. They were in “time out” so they didn’t “taint” the Cluster analysis – I could look at them later. The Cluster is talking to you – try to understand what the message is.

3. The Clusters *tend* toward a single MRCA, as the upper cM Threshold is decreased.

4. Do not be afraid to move a Match from one Cluster to another. Review alternate “grey” cells in the the HTML Cluster diagram. If a Match has, say, 5 squares in a Cluster, and several grey links to another Cluster (which other Cluster is a much better “fit”), I would not hesitate to move that Match. Usually this will resolve itself in subsequent Cluster runs.

5. Excel Macro – for the task of copying 3 cells from a Match from a new Cluster run and pasting it into that Match from the previous Cluster run, and then deleting the first Match. Here are the steps:

        a. Go to File > Options > Customize the Ribbon > add “Developer” to the Main Tabs

        b. In the spreadsheet, insure “Use Relative References” is ON [highlighted]

        c. Position cursor on the CL cell of the top Match;

        c. Click Record Macro [fill in the popup – the only critical thing is a letter for the Macro]

        a. Highlight the three numbers [in CL, Super CL, and order columns]

        b. Control-C to copy that data

        c. Click on next cell to the right

        d. Type: x [this will let you easily delete all these rows later)

        e. Click on the CL cell in the next row (this should be the same Match from previous run)

        f. Control-V to paste the data into three cells

        g. Click on the CL cell in the next row [to preposition the curser for the next Match]

        h. Click Stop Macro.

        i. Save Spreadsheet with Macro Enabled

        j. Good luck – it took me several tries to get it right. Practice on a spreadsheet copy.

6. Special Note: Some close Matches have multiple MRCAs with me. They may well be related though multiple Clusters. I make duplicate copy of that Match and add it to other Clusters per the gray cells. Once moved I adjust the CL and super columns per their new Clusters. Use judgment, but I think after about two cycles with the multiple copies of close Matches (closer than the Cluster Root Ancestors indicate), they can be eliminated from future Clusters. They have done their job of solidifying the root ancestors in other Matches.

7. I also think the maximum/upper cM Threshold needs to be reduced as the Clusters evolve. We don’t need the higher cM/closer Matches – they have already passed on their Root Ancestors to the Clusters in the Master Spreadsheet. They should be dropped from the Spreadsheet. I put an X in the CL column to remind me they are no longer needed.

8. Some Matches wind up in singleton Clusters – this is silly, a Match doesn’t form a Cluster with itself. And most of the time these Matches show a grey link to another Cluster. I move (Ctrl-X; Ctrl-V) the Match row to the other Cluster and change the CL cell to match that Cluster (so they will sort with that Cluster in the future). I sometimes also move Matches out of very small Clusters when that seems appropriate. Most of the time subsequent Cluster runs resolve these issues.

9, If a Cluster goes through several iterations without any indication of a more distant RAs, there may be an MPE or brick wall involved –a strong potential clue from the data.

Manual WTCB Process

If all of this is overwhelming, you can try a few iterations using manual Clustering. Start with the Leeds Method that results in 4 Clusters, one for each grandparent. So in these 4 Clusters you already have two Root Ancestors for each [2-4, 2-5, 3-6 and 3-7, using Ahnentafels]. Find your Matches who are in the 80 to 90cM range and manually Cluster them. Start by seeing which ones are Shared Matches with the ones in the 4 Clusters – that automatically gives each one the same two Root Ancestors as the Cluster they share {actually the Matches they share). Now, from the information you know about these new Matches, do any have an MRCA at the 2xG grandparent level – this would give you the next Root Ancestor – for that Match, and that Matches shared Matches. Keep dropping the cM Threshold, checking Shared Matches for Cluster affinity, and using the Matches with MRCAs to tease out the next Root Ancestor for each Cluster. This is workable with a small number of Matches, but when you have 500 or 1,000 Matches to work with, you will yearn for automated Clustering…

Tracking RAs

Some results so far:

At 60cM run: 11 Clusters: with generation 5 (G5) RAs:

        Paternal RAs: 8, 8, 9, 9, 10, and 11; Maternal RAs: 12,, 12, 12, 13 and 14/15

            -The last Cluster, 14/15, is my maternal grandmother whose immigrant parents had two brothers married two sisters resulting in few Matches, those are hard to separate until I can get more distant Matches.

            -I’m happy with this spread – it includes Clusters for all 8 of my Great grandparents. The WTCB is working…

            – The 70cM run had 47 Matches in 8 Clusters; 60cM run had 75 Matches in 11 Clusters. Roughly double the number of Matches (and analytical review work) in 3 additional Clusters. My experience is that the doubling of Matches with each 10cM decrease in Threshold continues…

At 50cM run: 128 Matches in 24 Clusters (net, after moving several singleton Matches to Clusters they shared with other Matches).

        Paternal RAs: 8, 8, 17; 9, 9, 9, 9, 9, 18; 10, 10, 11, 22; Maternal RAs: 12, 24, 24, 24, 24; 26, 26, 26, 27, 27 and one 14/15.

            -These are broken apart quite nicely, I think. And the uneven nature of the splits (not cleanly by generation like the 4 grandparents often do); illustrates the folly of trying to find a sweet spot in the Thresholds to result in one specific generation (like we get with grandparents). I should have expected this – beyond the grandparent level the Shared cM Project shows growing overlap of cM values for a growing range of cousinships. So, this WTCB process just lives with that, and tracks the Matches as the Clusters grow in size and split apart – Walking The Clusters Back!.

Abbreviations

Ahnen – abbreviation for Ahnentafel number – a system of numbers to represent our Ancestors [e.g. 2 for father; 13 for mother’s father’s mother] – see also this blogpost.

CL – Cluster, or Cluster Run [usually combined with a number representing the lower cM Threshold]

Czn – Cousinship – how we are related to a Match. Second Cousin is abbreviated 2C; 5th cousin once removed: 5C1R.

DGC – DNAGEDcom Client – an automated Clustering program – runs from your computer.

MRCA – Most Recent Common Ancestor – this is usually a couple that you and a Match have in common. Usually represented by the Ahnentafel of the husband, but we really don’t know which parent (husband or wife in the MRCA couple) the shared DNA came from.

RA – Root Ancestor – the Ancestors you have leading up to the MRCA. This should always include your parent and grandparent (each is a RA). During the WTCB process, the number of RAs will generally increase (adding generations) and increasing the ancestral “focus” for each Cluster.

TG ID – Triangulate Group Identification Code – see this blogpost.

WTCB – Walking The Clusters Back – the process discussed in the post which helps determine the MRCA of most Clusters – sort of a Leeds method on steroids.

Final Thoughts

This WTCB process uses the power of Clustering to link large groups of Matches to specific areas of your Ancestry. As the process develops, the Clusters become more and more precise on the path back to an MRCA. There are only two options for each Cluster going back another generation – going back on the paternal side or the maternal side. Larger Clusters with more distant Matches, tease this information out of the data. The Homework is essential – recording what you know in the Notes; and the work is sometimes tedious; but the end result is very powerful.

I’m confident this process will tell us some Root Ancestors for all of our Matches down to 20cM. Just think what we could do with those clues…

Feedback on this process and suggestions for improvement are welcome.

[19N] Segment-ology: Walking The Clusters Back by Jim Bartlett 20220822

32 thoughts on “Walking the Clusters Back (WTCB) 2022

    • eallynm – I use the Dup column to indicate when a Match-segment is duplicated for multiple CAs. With multiple CAs, the closest one has a much higher proability of being “the one”, but not guaranteed. I list all the legitimate CAs, until I’m certain there is a consensus for the TG. For a Match with three CAs, I copy the Match 3 times, and add the cuzn and MRCA data for each, and in the dup cells I put: A1, A2, A3. Jim

      Like

      • So is the “x” column in the Master WTCB Spreadsheet the “dups” column? Do the numbers 2 and 3 refer to the number of times you have duplicated these matches? E.g. Roger is listed three times (but only two rows show)?

        Like

      • eallynm – yes. Roger is also somewhere else in the spreadsheet. He had multiple segments, so technically could be in (and influence/impute) multiple groups.. Jim

        Like

    • Let me ask a more specific question about duplicates and tangled pedigrees. My paternal grandfather’s aunt married my paternal grandmother’s uncle, and they had a large family. I have many matches who are double cousins descended from this couple. The MRCAs for the double cousins are 18 and 22. They sort initially into the same cluster as those who are only MRCA 22. I don’t initially have a cluster that is just MRCA 18. Do I create an additional cluster for MRCA 18 and put the double cousin matches in that group?

      Like

      • Emily – it’s a judgment call. I would do that to cover the possibilities, recognizing that in the end each Cluster should point to only one Ancestor, but we cannot really tell which one it is at this point. It might take some additonal analysis at of the Triangulated Groups. Clustering is not the best tool for endogamy – Triangulation is, IMO. And that’s just my opinion. The segments follow a specific path; Clusters do their best to group around a Common Ancestor, but it’s not as precise as Triangulation. Jim

        Like

  1. Hi Jim
    I’m going to try to work this process through. I’m going to have to fiddle with the thresholds to try to get those first ancestors, though. My uncle (I am actually working on his kit not mine) has matches over 200 cM with no MRCA in sight.

    But you have given us so many organizational tips that I am sure it will be helpful to keep me focused.

    Re 23&Me, someone in your comments somewhere had a tip for expanding the number of matches I can retrieve with a front-end search. I can get 1500 and even though I know by dumping all my matches out to spreadsheet that Leisa could be useful for triangulation, I cannot access her from the 23&Me front end–thus cannot triangulate. What am I doing wrong?

    Sorry for all the catch-up I have to do…

    Kate

    Like

  2. Wonderful, thorough and thought-provoking post. I’ve been trying this approach for several weeks and am at 35 cm. Some questions and thoughts:
    – Is a supercluster necessarily a combination of husband and wife? (See thought process below)
    – I guess each segment points to a single ancestor (closer ancestors may have multiple segments). There are clusters where I can identify the 3rd, 4th and 5th grand couple, but what tips do you have on identifying the individual at that level?
    – One group clusters at the 3rd gg couple at 80 cm, then from 50 cm down the cluster just grows larger to include matches to the 5th gg couple. This cluster now has 75 people in it, but it is one cluster. It is however part of a supercluster, with a separate cluster of two people. My guess is that this small cluster is a spouse, or 6gg.
    – Any suggestions on how to record the 80 cm matches to 3gg, and also record their participation in the 5gg cluster as well? Or do you just record the first cluster match despite the change in ancestor in successive clusters?
    Hope this makes some kind of sense. I’m really enjoying this and am beginning to catch on.
    Amanda Sherwin

    Liked by 1 person

    • Amanda, Thanks for the positive feedback – and all of your questions are good ones – I’m jousting with each one of them, too. Here goes….
      – Clusters and Supercluster *tend* to point to one Ancestor, but… (there’s always a but) that is if all the Matches in the Cluster were from the same generation back. The wider the spread between the upper and lower cM thresholds, the more range of cousinships in the Cluster – it’s why I recommend lowering the upper threshold. But in the “midrange” of cMs there can still be a lot of variation on generations back. So the short answer is the sub-Clusters don’t have to be husband and wife. However each supercluster should have some Ancestor (maybe a grandparent, or G-grandparent for all of the sub-Clusters.
      – Because the DNA is pretty random, there won’t always be a single Ancestor – the important thing (like in Triangulated Groups) is that the MRCAs are all on one line. In most Clusters they tend to group on one or two Ancestors (a generation apart), with some outliniers (but still on the line)
      – Always look at the DGC Cluster spreadsheet – I’ve found some of the small Clusters are built to include one Match – booo – that one Match often has some gray cells to another Cluster which might be a better fit. If a Match only has 1-2 Shared Matches in a Cluster, I tend to cull them out – I put UNK in the cluster cell, and look at them again later to see where their Shared Matches are tugging them. These such Matches are a drag and a distraction – on the loosest of evidence they pull the whole Cluster. IMO, it’s best to cull them out and stay with “solid” Clusters.
      Note: the number of Matches and Clusters about double with each new run – it gets to be a LOT of work (but worth it IMO)
      Note: When in doubt – leave it out! Don’t try to force a Match. Use the ones that build solid Clusters. The culled Matches will be revisited in next Cluster run. I’ve seen some in the 30-35cM range that pile up in a new Cluster – with little good info to go on (at this point we are starting to push the Clusters out several more generations) – but then the next run with Matches in 27-30 range added in some good clues.
      – Three ways to work with large cM Matches 1. Cull them out. 2. In your spreadsheet, duplicate the row (as many times as necessary), and – using your judgment – add them to other Clusters (check Shared Matches and/or follow the gray cells in the DGC spreadsheet showing the clusters (not the HTML); 3. Trust the Root Ancestors – the Larger cM Matches will identify the Parent and Grandparent Root Ancestors – then you inpute those RAs into the next Matches (even if they don’t have a Tree). The “bulk” of these Matches form a foundation of Root Ancestors in each Cluster, that carry over the Matches in the next round. Once you determine that a bunch of Matches in a Cluster had to come down (to you) through one of your 2xG grandparents, that information never changes – those Matches carry that info in the Root Ancestors and pass it on (imputation) to the next round of Clusters. As you determine what you can with the >35cM Matches, record that in their Root Ancestors. I usually leave that alone in the future and add additional RAs to new Matches on top of those as they become apparent.
      Note: a regular jigsaw puzzle will have only one solution and use all the pieces. Clusterng may not use all the Matches, and may not have the finality a jigsaw puzzle has. But it *does* extend your base – a LOT. The next step is to take each Cluster and try to find more MRCAs with Matches, now that you have narrowed down the possibilities.
      Hope this help, Jim

      Like

      • Thank you so much for your response. This gives me a lot more insight to work with the clusters. I’m working on the 30 cm now and can see how one pair of greats is beginning to separate – lots of pedigree collapse there. I really like your idea of copying matches to include in different clusters, as that’s been a problem with another line. One of the questions that has puzzled me over the years has been why certain ancestors moved to a specific location and I’ve tried to find the possible relative that pulled them there. This has the potential to provide new clues in that area by linking my ancestor to a very different descendant of a 5g. Thanks for writing up this approach in such detail!

        Liked by 1 person

      • Do we cull out matches who are descendants of first cousins? I have a situation in which a 1C1R and her son fall into a large solid cluster, but the 1C1R’s daughter ends up in a singleton cluster which makes no sense. Not sure whether to put the 1C1R’s daughter in with her mother and brother or just eliminate all three.

        Like

      • eallynm, Close cousins should help you with the correct grandparent or great grandparent. If they don’t do that, then ignore them. Use the Matches that help, ignore the ones that don’t. The point is NOT to include every Match, the point IS to try to differient each Cluster and Walk It Back. Jim

        Like

  3. Pingback: Segment Data for Ancestry Matches 2 | segment-ology

  4. Thanks Jim. I have both my dna kit and my fathers on Ancestry. I have linked our kits to my tree. However we obviously have two separate sets of matches. Seems like a waste to walk my paternal matches back when I have my dad’s. Also, I have only added notes to my matches, not my dads. Any thoughts on how to (efficiently!) approach walking the clusters back in this situation?

    Like

    • Pat, The Notes are critical – particularly when they include an MRCA or a TG segment. I’d recommend 1. your kit; 2. run the DNAGedcom Client to gather say, 50-400cM – that should go fairly quickly. Then run the Clustering with an 80cM Threshold, and see if you can identify Root Ancestors for all of those Matches. 3. Then drop down to 70cM and Cluster again; and combing duplicates and imputing addional RAs to the new Matches. 4. Then try it at 60cM. You’ll either get very frustrated and want to quit, or things will fall into place and you’ll want to continue. Jim

      Like

  5. Pingback: WTCB SITREP | segment-ology

  6. Pingback: Best of the Genea-Blogs - Week of 7 to 13 August 2022 - Search My Tribe News

  7. Great post, Jim. For now I have taken my segment work about as far as I can. I need to go back to those clusters and revise them and extend earlier work. So this is very timely.
    Record chasing is coming up against too many people using the same names way back, so it’s getting hard to extend the tree. Some cluster revisions should help me work on the right matches to find the right path.

    Liked by 1 person

    • Christopher, Genealogy is always a challenge. But I’m finding that WTCB is doing a pretty good job of find Root Ancestors for my AncestryDNA Matches back 5-9 generations – that’s a real help in finding them in Trees. Jim

      Like

    • Christopher, I now have almost 5,000 Matches with Common Ancestors – I have multiple MRCAs for every documented Ancestor who is covered by ThruLines (6C and closer). And yet, I have about 25 MRCAs which clearly do not fit in the Cluster they are assigned to. I’ve looked at other grey-cell linked Clusters – they don’t fit there either. So either the MRCA is wrong (.5% slipping by me is actually pretty good). Or the Match is related to me some other way. I’ve added a new column to my WTCB Spreadsheet labled “issue” – I don’t want to be distracted right now (untill I can get down to 20cM); and there is always the possibility that DGC will Cluster them somewhere else in a different run. Right now, however, the data is *shouting* to me: issue – something is wrong. This is actually a good outcome of the WTCB process. Jim

      Like

  8. I had started doing something similar, but I am including all the data in my Common Ancestor spreadsheet.

    Instead of using separate columns for the clusters, I did a CONCAT formula and put them together like this: MHnn-Snn-nn or ADnn-Snn-nn. So they might look like MH70-S01-03 or AD45-S03-01. Makes my sorting easier. I have separate columns for CL90, CL80, CL70, etc. where those designations go. And I am trying to match up my MyHeritage and Ancestry Clusters to the same MRCA’s. I have one MyHeritage cluster that I just cannot figure out, and I’m hoping the Ancestry Clusters will give me clues as I have more Colonial American ancestors there than in MyHeritage.

    I also have not done ALL my homework (getting all those thruline matches in is very time consuming), so by using my Common Ancestor spreadsheet I have all my notes (per your recommendation) in there. I have learned that once you do an Ancestry gather on DNAGEDCOM, if you keep adding to it or rerunning it, it will not pick up new notes for people that you have already gathered. If you want those added notes, you have to start over with a new gather in a new database.

    I was excited to see this last post (I look forward to every one of them!). Made it a lot clearer. I’ve read your entire blog 3 times now, and each time it makes more and more sense. I even triangulated all my MyHeritage matches (took me 5 months of solid work, but it is invaluable!!!!)

    With the work I’ve been doing the last few days since your post, this is making things much easier and faster (The G2, G3, G4 … columns FINALLY made perfect sense to me!). And it is really helping to match up the MyHeritage and Ancestry clusters. For now, I am going to add just the cluster names into a separate spreadsheet and document each as to their known MRCA’s.

    One question – should I just drop out the children of my 1st cousins or leave them in? I find they show up in odd ways (however, this was good as I’ve discovered a new 1C1R or 1C2R that I can’t yet figure out who the parent is. May have an NPE involved.)

    Also, would love to hear how you handle people who pop from one set of MRCA clusters to another. I have a 2nd cousin who matches perfectly (and is well documented) on my father’s Holtzclaw line. But when I get down to the 45cM Clusters he pops to my father’s Reese line. This cousins grandmother has an undocumented father, and I’m thinking that this is a strong clue?

    Like

    • Linnea – Wow!! You are a 5-Star Segmentologist. As you point out, it takes work, but it’s worth it!.

      I Clustered all my FTDNA Matches and also had them all in TGS (except the few who were false segments). I then VLOOKUPed the Clusters to the Matches in my Master Segment Spreadseet. A few didn’t line up, so I investigated them, and found virtually every one of miss-fits had a grey-cell Custer which did fit – so I just moved them. And then had a 99% correlation rate. You might try that with your MH TGs and Clusters.

      I have Colonial Virginia ancestry and several Match cousins who Match more than one way – including a few that have MRCAs on both parent sides. From segment Triangulation I also have some Matches with multiple shared DNA segments from different sides – so it does happen.

      I, too, have some Match cousins which want to be in two Clusters – they have grey cells to other Clusters. Where it looks advantageous, I just copy that Match and add the copy to the other Cluster.

      I also fiddle with the upper cM Threshold. The close Matches are helpful, initially, in establishing the Root Ancestors for Matches without MRCAs – although I don’t know how we are related, at least I know which grandparent or grandparent is in the path – this eliminated a big part of my Tree from investigation. Affter the initial Cluster runs, I drop the upper Theshold down to eliminate the very close Matches – 1. They don’t add any value any more (they’ve done their job of highlighting the close RAs); and 2. The tend to jump around, and are confusing.

      I’m down to 35cM now with 61 Clusters, almost all of them have great grandparent RAs and over half have RAs out several more generations.

      Thanks for your great, encouraging, feedback,

      Jim

      Like

  9. Jim, fantastic post!

    I’ve used a much less rigorous version of this approach using the CLM charts to visually fit clusters based on lower cM threshold gathers into clusters generated from higher cM gathers. I do this by finding matches that I can use to correlate the clusters at the different gather sizes. Kind of like a Russian doll with CLM charts and lots of connecting lines connecting printouts.

    While this visual approach is very fast (I can usually get down into the 30 cM range in a few hours) I’ve acknowledged to myself that it can be a victim of confirmation bias, cuz when I find a few matches that l’m looking for in the next lower gather, I declare success and move on to the next gather.

    Your rigorous approach seems much more reliable, especially with multiple relationships or pedigree collapse…. Which is where I have to be very careful with my quick and dirty visual method. I will definitely try your method and let you know how it goes.

    Mark

    Liked by 1 person

    • Mark, I, too, have used several Q&D versions over the years. I developed this template and process to get *all* my Matches involved – to get them each to join with their cousins and point up a different one of my ancestral lines. I’ve already found two cases where the path is truncated and I found a new Ancestor on the other side… This WTCB process using all Matches, should “cover” all of my Ancestry – back 8 or more generations (which is beyond my current Tree in several places). Jim

      Like

  10. Another great post Jim!
    One feature I really like on DNAGEDcom is that you can run the Collins-Leeds Method from the database OR from the matches spreadsheet. My maternal side (French Canadian) has many many times more matches than my paternal (Italian) side (thousands vs hundreds). So what I do is mark all of my paternal matches in Ancestry with the starred matches group. This is the only group info that is carried over to DNAGEDcom matches spreadsheet (column K). I then sort the spreadsheet on Column K and then delete all the matches that have STARRED=FALSE. I save that spreadsheet with “paternal” in the name. I then run CLM off of that spreadsheet to work with just my paternal clusters. It’s not necessary to edit the ICW spreadsheet (it’s too large to load anyway) since the process is already limited by the matches edits.

    Like

    • John, Thanks for your feedback. I must have missed a setting as I don’t get a column K. I do however get my Notes in column J, which tells me a lot about almost all of my Matches over 20cM. Jim

      Like

      • Sorry I was not clear. I was not referring to the CLM3D spreadsheet column K but rather the m_ csv file for matches Column K when viewed as a spreadsheet.

        Liked by 1 person

      • John, Thanks for the clarification – you are correct. The m (Matches) file includes the Star info in column K and a lot of other info, but it does not include the Cluster data. The Cluster data is the backbone of the WTCB process. In order to incorporate the Star info, we’d need an Excel VLOOKUP program to compare the MatchID in each spreadsheet and fetch the Star data from the m spreadsheet. Jim

        Like

    • Caith – Thanks for your feedback. I hope many will try a few Cluster runs – the first few are relatively easy. The work builds as we reduce the Threshold, but by then I think many will see that it’s working, and worth the effort. Jim

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.