Spreadsheets are an important tool in Genetic Genealogy. Here are some of mine…
ANCESTORS – Names, dates, locations, and Ahnentafel # are the key foundational data in this spreadsheet. Over time I’ve added columns for: Immigration year; age at death, age at marriage, number of children, Religion, Profession, Military, War, Y-haplogroup, mt-haplogroup, Find-A-Grave hyperlink, remarks. Add any other column of interest to you – you just need to fill it in… I add in Potential (or Alternate) Ancestors (highlighted) to keep track of those possibilities. I have a “dup” column to indicate which Ancestors are duplicates. This is a very handy reference for me. Two main sorts – 1) by Ahnentafel #; and 2)by surname & birth date. [Initially I took a GEDcom of an Ancestors only Tree and put it into a spreadsheet – then massaged the columns]
COMMON ANCESTOR MATCHES – Names of all Matches who have a Common Ancestor(s) with me. Key data: Name, Admin, cM, #Segs, Company, CA Ahnentafel #, Cousinship, columns for name and birth year of child, grandchild and great grandchild of CA (for Match’s line of descent); hyperlink to Tree. I also have columns for TG or Cluster, GEDmatch #; Remarks. I also have columns to indicate (to me) if I’ve filled in the Notes box of the Match, entered the line of descent to the Match in my Tree… Main sort is by Ahnentafel # and dates of Child, Grchild – this sort looks like a Family Group Sheet – and is very helpful in tracking TGs and Clusters. It also lets me focus on family groups which often group together in TGs or Clusters.
TG MASTER – Match name, company, Admin, email, Segment info (Chr, Start, End, cM, SNPs), TG ID, Side (M or P), CA info (Ahnentafel #, Surnames of Couple, Cousinship, Tree hyperlink); GEDmatch #; date. If you plan this type of spreadsheet for other people, add a column for initials of test taker (you can then, briefly, combine spreadsheets, do analyses, then separate them again – one spreadsheet per person). Advanced: I have columns for each of 10 generations, and enter my Ahnentafel #s from my parent (2 or 3) out to the CA – this helps me analyze multiple CAs in a TG. Headers: 46 Chromosome bars (rows to separate data); TG bars (that summarize the Chr, Start and End of each TG and the CA) – sometimes there are multiple bars – I highlight the most likely. Main sort: by Side, Chr, Start (this will arrange all Shared Segments into their respective TGs – which should have one Ancestral line)
TG SUMMARY – Chr, Start, End, TGID, Side, columns for 8 generation of Ahnentafel #; Cousinship, CA Surnames; NO Matches. This is a summary subset of the TG MASTER – except I fill in only the Ahnentafel # for known CAs out to most likely distant CA. In italics, add in Ahnentafel #s from Cluster analysis compared to known TGs – this often extends the evidence in the TG (this is an experimental spreadsheet at this point. Sort: Side, Chr, Start .For me, this is a handy 2-page crib sheet.
WALK THE CLUSTERS BACK – a fairly technical tool. Start with a download of Excel data for Clusters based on a 50cM threshold (from DNAGEDcom Client). This will include Match name, cM and any Notes you’ve entered for each Match – this Note info is very valuable to have in this WTCB spreadsheet. Add columns for Ahnen (or CA Surnames) and a TG or Cluster (CL) code and Remarks. Finally add a column for serial # of each row (1 to how many rows you have at that threshold – so after a sort, you can reconstruct the original Cluster groups – just type a 1 at the top and drag it down in a series) – call this column CL50. And add another column – called CL# – and add in the Cluster # for each Match (I wish DGC would include this in the download spreadsheet. Then the work is to determine the Ahnentafal/CA and/or TG CL ID for as many Clusters as possible (hopefully your Notes will show you all the info you’ve collected about each Match). Add a summary row for each Cluster (using 0.4 in the serial number column – now when you sort on CL# and CL50, all the Clusters will be grouped with a header. IF there appears to be a consensus for the Ahnentafel/Surname and/or the TG/CL columns, enter that in this header row. Next, rerun the Cluster report with 45cM threshold – add two new columns for serial # in a new CL45 column and the Cluster # in the new CL# column. As before, add in a header row for each Cluster with 0.4 in the CL45 column and the Cluster # in the CL# column. Now add this spreadsheet to the CL50 spreadsheet, sort on Match name, and combine duplicate Matches onto one row (Matches in both CL50 and CL45 runs will have two Cluster #s and two serial #s – don’t worry. Resort on CL# and CL45 to get all the Matches in Cluster order again, with the added info from CL 50. Again, analyze each Cluster with a goal of finding the CA and/or TG/CL for most Clusters (for the Cluster header rows. If advantageous to see where a new Cluster is going, make a duplicate copy of Match rows which have strong affinity for other Clusters (and code it with the other Cluster number) – use to help identify CA for new Clusters. This is very much a judgment call, which will be confirmed or refuted in follow-on Cluster runs. Drop the cM threshold by 5cM and repeat. The number of Matches begins to increase dramatically – it’s a lot of work. But the benefit is that you are imputing Ancestral lines to many Matches who are Private or have little/no Tree. If you add the imputed into to the Notes of these Matches, they will show up in Shared Match lists and “flavor” them with an Ancestral line. Again – this process is experimental and requires us to use judgment. Future investigation – the Clusters from different companies should be roughly the same – it would be great to be able to link Ancestry Clusters (with many MRCAs) with Clusters from 23andMe, FTDNA, MyHeritage and GEDmatch (with TGs)…
Apologies – this WTCB “short” summary got a lot longer than I originally intended. I’ll have to post a more complete version later. The takeaway is: gradually reduce the cM threshold of Cluster runs and trace the Ancestry from the initial grandparents on out the ancestral lines, using new, smaller Matches which are often more distant cousins with more distant MRCAs. Think of a time-lapse movie of a plant growing, with new limbs branching out over time. If (when) we see a Match with an MRCA which is clearly out of whack with the “history” of the Cluster, it’s time to see if that Match would better be relocated to a different Cluster (per the gray cells) or a different MRCA needs to be found. It is OK to move a Match to a more probable Cluster if it has a lot of Shared Matches with that Cluster, too. We can use our judgment…
[35B] Segment-ology: Genetic Genealogy Spreadsheets; by Jim Bartlett 20211207
Jim, I understand how you code triangulation groups, but I am uncertain about how you assign codes to clusters, i.e. the CL codes you mention. Clarification, please?
Emily, I number the Clusters from 1 to whatever. If I am Walking the Clusters back, I preface the number with the cM threshold; eg: 30-62 for 30cM threshold and 62nd Cluster. The Cluster threshold is needed because #62 in the 30cM Cluster run may be different than #62 in the 25cM Cluster run – usually is.
Pingback: How Do I Level Up the DNA Evidence for My Ancestors? - 51posts
Pingback: How Do I Level Up the DNA Evidence for My Ancestors? – Genes & History
If one gets more than four clusters with the 50cM cutoff for the DNAGedCom Collins-Leeds Clustering, should one still start with 50 cM or increase the cutoff until one gets clusters that correspond to one’s grandparents? I am looking forward to seeing your more complete description of this spreadsheet.
Emily, Either way is OK. In my case I knew the Common Ancestor for each of the Clusters I got at 50cM – so I started with that. You made a good observation. Technically, I should say start with 90cM (per the Leeds Method). But the overarching principle is the same – identify the Clusters you can, and then lower the threshold to add more Matches and Clusters to the mix. Some of the new Matches will have a CA in their Notes. So another important point is that you should spend some time with ThruLines and your closest Matches to identify as many CAs as you can (and put that information in the Notes), BEFORE starting the Walk The Clusters Back process. That process relies on incorporating all of the CAs you’ve found. Which in turn, helps you find more CAs – so it’s very much an iterative process. Jim
Regarding the “Walking the Clusters Back Spreadsheet,” when I use DNAGedcom Client for Ancestry, I do get the cluster (and supercluster) number in the Excel spreadsheet, but I don’t get my notes. I did not see an option for getting notes. Did I overlook something.
Also, did you mean TG instead of TL? If not, what is TL?
Thanks again for your very helpful information about how you make use of spreadsheets.
Emily – thanks for your post. You get 3 documents from the DNAGedcom Client report: HTML, CSV, and the spreadsheet. Open the spreadsheet and look at the second tab – called Data…
TL was a typo, I fixed it to read TG (I often use TL as an abbreviation for ThruLines, but that was not the context here – good catch, tnx)
Instead of duplicating so much info across multiple spreadsheets using databases like Microsoft Access, genealogy data analysis tool (formerly Genomate Pro), or online tool Airtable would consolidate the data and reduce potential errors. Have you tried these?
Bonnie, I have not. A long time ago… I use MS Access, but cannot remember much. I had GMP for a while, but the learning curve was too much for me at the time (I’m very invested in the Spreadsheets I’ve developed – it’s something I can handle and they work for me – the way I want them too. But Steven Covey’s Habit to “Sharpen the Saw” is haunting me… Jim
What does “TG” stand for? CL?
pierre, TG is Triangulated Group; and CL stands for Cluster, Jim
Can you provide a link to download a sample of these spreadsheets?