This blogpost is overtaken by a better analysis by Kurt Allan, based on other analysis by Louis Kessler and information from Doug Speed that his chart was intended for a different purpose and might not apply to genetic genealogy. The result is a spreadsheet similar to the one below, but with a more normal distribution curve with 7C-9C occupying the mean. This is very good news for genetic genealogists – most of our Matches are well within a genealogy horizon. I hope to be able to post or link to Kurt’s final graphs soon.
A recent discussion on the Genetic Genealogy Tips & Techniques facebook page asked about what percent of our DNA matches we should expect at various genetic distances. I’ve often wondered about this too. As I thought about it, we should be able to apply the “Speed and Balding” analysis to this question. The S&B graph shows the probability of a matching DNA segment at different generations (think cousins), for given ranges of shared DNA. See the graph at the ISOGG wiki here.
I scaled each “bucket” in this chart as best I could and put the bucket percentages in an excel spreadsheet – see below. In the Speed and Balding chart, cM ranges are along the x-axis; percentages on the y-axis; and the “generations” are shown as stacked bars (or “buckets”) for each cM range. The numbers in the body of this chart are the percentage for the cM range and Generation.
I had in my files a complete download of my AncestryDNA Matches by DNAGedcom Client from several years ago, before the Ancestry purge of 6-7cM Matches. I had 131,824 Matches and it was easy to sort by cM and determine the total number of Matches for each column (cM range) in the S&B chart. Finally I applied the S&B percentages to my breakdown of Matches to get the following chart.
I know it’s a “squinter”, but I wanted to show the whole spreadsheet. Here are explanations of the lettered rows:
A. The cousinship which corresponds to the S&B generations back
B. Speed & Balding generations
C-N. The first column is cM range groups that correspond to the S&B chart.
C-N. The second column is the number of my Matches in each category –131,824 Total.
C-N. The next columns: multiply S&B percent by number of Matches in the cM range
P Total number of Matches for each cousinship
Q Percent of cousins vs the 131,824 Total
1. I could be off by a percent or two in my scaling of a printout of the Speed and Balding chart – but the totals are pretty close, and what I am really looking for is trends and order of magnitude.
2. Line Q, percent of total, was a lot flatter than I had expected – less than 4 percent for any cousinship. I had expected something closer to a normal distribution curve – even a long one, but with a “hump” somewhere. This indicates the two competing factors: an increasing number of cousins with each generation going back, verses a decreasing probability of a shared DNA match (above 6cM) with each generation going back.
3. There are a lot of Match-cousins to work with. Although only about half of all our Matches would be related to us out to the 19th cousin level; nevertheless, there are thousands of cousins in every cousin “bucket”.
4. In my own case I need to use judgment and temper some of these results. Both of my parents were only children, so I have no 1st Cousins. And my great grandparents did not have large families, so I also don’t have many 2C. However, I do have about 300 3C Matches and 600 4C Matches identified, so far, and there are plenty more out there (at least per Speed and Balding). And I am finding many 5C-8C Matches (but my known Tree begins to thin out after that.)
1. Autosomal DNA “works” throughout a genealogy horizon for most of us.
2. The limiting factor is NOT the atDNA, it’s the genealogy – the lack of good Trees among our Matches; and the shrinking body of documentation the farther back we go.
3. When Matches Triangulate or group in Clusters, it’s often worth the effort to extend their Trees and find the Common Ancestor.
This blog post is one in a series to try and outline what you can generally expect – to put some generalized boundaries on genetic genealogy.
Anyone is welcome to use my estimate of the S&B data in the first spreadsheet, and apply it to the distribution of your own Matches. Please let me know if you see a glaring error in this process or the results.
[06D] Segment-ology: Distribution of Cousins by Jim Bartlett 20211209