About jim4bartletts

I've been a genealogist since 1974; and started my first Y-DNA surname project in 2002. Autosomal DNA is a powerful tool, and I encourage all genealogists to take a DNA test.

Advanced Genetic Genealogy Book

Advanced Genetic Genealogy Book

I should let you all know, in case you missed it, there is a new book on DNA which just came out. There are several beginner and intermediate books on DNA for genealogy already available. But to my knowledge this is the only one so far focused on advanced topics. There are 14 chapters, each by a different author with lots of genetic genealogy experience. I wrote Chapter 1: Lessons Learned from Triangulating a Genome. The Editor, Debbie Parker Wayne, and I were at the FamilyTreeDNA Genetic Genealogy Conference in Houston, TX on 23 March when we found out the book was available on Amazon. So Debbie made an impromptu announcement with the proof copy she had…

Pictured: me, Debbie (Editor & Chapter 7), and Pattie Hobbs (Chapter 10)

Here is a picture of the front and back of the book:

Edited 4/6/2019 to add list of Chapters and Authors:

  1. Lessons Learned from Triangulating a Genome, Jim Bartlett, PE
  2. Visual Phasing Methodology and Techniques, Blaine T. Bettinger, JD, PhD
  3. X-DNA Techniques and Limitations, Kathryn J. Johnston, MD
  4. Y-DNA Analysis for a Family Study, James M. Owston, EdD
  5. Unknown and Misattributed Parentage Research, Melissa A. Johnson, CG
  6. The Challenge of Endogamy and Pedigree Collapse, Kimberly T. Powell
  7. Parker Study: Combining atDNA & Y-DNA, Debbie Parker Wayne, CG, CGL
  8. Would You Like Your Data Raw or Cooked? Ann Turner, MD
  9. Drowning in DNA? The Genealogical Proof Standard Tosses a Lifeline, Karen Stanbary, CG
  10. Correlating Documentary and DNA Evidence to Identify an Unknown Ancestor, Patricia Lee Hobbs, CG
  11. Writing about, Documenting, and Publishing DNA Test Results, Thomas W. Jones, PhD, CG, CGL, FASG, FUGA, FNGS
  12. Ethical Underpinnings of Genetic Genealogy , Judy G. Russell, JD, CG, CGL
  13. Uncovering Family Secrets: The Human Side of DNA Testing, Michael D. Lacopo, DVM
  14. The Promise and Limitations of Genetic Genealogy, Debbie Kennett, MCG

Glossary

Recommended Reading

Index

 

 

[99B] Segment-ology: Advanced Genetic Genealogy Book by Jim Bartlett 20190406

Clustering Programs

A Segment-ology TIDBIT

A number of folks have asked me about the different Clustering Programs, so I thought I’d post some information to get you started.

Clustering analyzes your InCommonWith (ICW) Matches at a company, and groups Matches who are ICW each other the most. Each Match in a Cluster will be ICW with most (but usually not all) of the other Matches in the Cluster. With Cluster groups of 4 or more Matches, they tend to group on a specific Ancestor, which would impute the same Ancestor to every Match in the Cluster. NB: this is not a guarantee, but it appears to work almost all the time.

Clustering Programs:

Leeds Method by Dana Leeds (free)

https://www.danaleeds.com/ see the Video and updated methods

This began as a color coding method of grouping close Matches at AncestryDNA into four columns, one for each grandparent. It has been expanded.

Genetic Affairs by Evert-Jon “EJ” Blom (several spreadsheets free, then a small fee)

http://www.geneticaffairs.com/ Register first, then log in

– automates the retrieval of new genetic Matches from 23andMe, FTDNA and AncestryDNA to a periodic email; and the AutoCluster tool will cluster close/large Matches

DNAGedcom Client by Rob Worthen ($5/mo fee; $50/yr)

Register here to start: https://www.dnagedcom.com/

– log onto your DNA company, and download Match and ICW files

– use Collins” Leeds Methos 3D to run cluster report

Shared Clustering by Jonathan Brecher (free)

https://github.com/jonathanbrecher/sharedclustering/wiki/Quick-start

– installs program on your computer

-currently need to download Match and ICW files at DNAGedcom Client

MyHeritage – offers a free report by Genetic Affairs!

GEDmatch – offering a Genetic Affairs type report soon! Under Tier 1 ($10/mo fee)

My recommendations include:

– Use a large threshold (80cM to 200cM) first to get the hang of it. This will only include your closest cousins.

– If offered, use an upper threshold of 1000cM or so, to cull out parents, siblings, children, aunt/uncle – they only appear in one Cluster anyway, and don’t really add any value in most cases.

– Reducing the threshold will increase the number of Clusters, and those Clusters will tend to form on more distant Ancestors.

NB: Some additional Clustering Programs and ideas may show up in the comments below. I’ve used all of the programs above. I have also continued to do D I Y Clustering, outlined in a different Segment-ology blog post.

[22AF] Segment-ology: Clustering Programs TIDBIT by Jim Bartlett 20190404

Clusters Link to TGs and an Apology

A Segment-ology TIDBIT

Are Clusters based on Common Ancestors (CAs) or Triangulated Groups (TGs)? I said CAs, and Jonathan Brecher said TGs. I now think Jonathan has the best answer. My apologies for doubting his conclusions.

My point was that with a few, large, close, Clusters, each Cluster must be formed on a CA, and include many TGs. A Match Clustering which results in 4 or 8 or 16 Clusters (which they tend to do) are clearly formed on 4 grandparent, 8 Great grandparents or 16 2xGreat grandparents – and this is born out with the Leeds Method and other experience. These large Clusters must each include many TGs – and my experience bears this out.

However…  There’s often an “however.” As the Clustering thresholds are decreased, the number of Clusters formed are increased. In my recent example I had 156 Clusters using a 20cM threshold. Note that one fourth of my Ancestry is from 1850’s immigrants, with very few Matches (and all of them were close cousins in one Cluster). I should have had about 208 Clusters. And 161 Matches did not cluster. This gets us pretty close to 256 6xGreat Grandparents 8 generations back. And a number of my 156 Clusters appear to link with only one TG [Note this is AncestryDNA data, most of it without TGs].

I am now reviewing my AncestryDNA Matches and trying to assign Cluster IDs to each one, by looking at the info I have in the Notes box for each Match and reviewing all their Shared Matches (and their Notes – easily viewed with MEDBetterDNA). In most cases, where the Match shares a single segment with me, I’m tending to identify a single Cluster. And when I have TG information, it’s tending to be one TG.

So, I’m going to eat some crow, and apologize to Jonathan Brecher. I now think he was on the right track, and that we should try to link Clusters to TGs (specific DNA segments). After all, each TG is from a specific Ancestral line. Of course, at AncestryDNA (without segment data), we’d still Cluster mainly on CAs. However, with a comparison between AncestryDNA Clusters and Clusters with other companies (with segments and TGs), we should be able to find a correlation between our AncestryDNA CA Clusters and TGs. Through this correlation, we could “impute” TGs to AncestryDNA Clusters.

So – thank you, Jonathan Brecher – for Clustering for several of us, for your comprehensive analysis and for your insight!

If anyone has been Clustering around CAs, that is still OK. Think of your Cluster CAs as potentially having multiple TGs – particularly the closer CAs (4C-6C range). And as you run Clustering with smaller thresholds, and find more Clusters, you’ll find your former Clusters subdividing into smaller Clusters, which smaller Clusters would tend to match up with one TG – Walking the Clusters Back!

 

[22AE] Segment-ology: Clusters Link to TGs and an Apology TIDBIT by Jim Bartlett 20190222

D I Y Clustering

A Segment-ology TIDBIT

Automated Match Clustering involves large spreadsheets; selecting max and min thresholds; downloading data; using third-party tools; and then analyzing the clusters. Is there a different way to Cluster AncestryDNA Matches? I think there is… Do-It-Yourself Clustering.

I think we can select a Match and then look at our Shared Matches and then, often, see a trend or pattern among them. If we’ve used the Note boxes liberally (see below), we might see known Common Ancestors (CA) among the Shared Matches and/or a known Triangulated Group (TG) among them. Note that we sometimes know one of these Building Blocks (CA, TG) without knowing the other – that’s OK, they are both important clues that are “pointers” to a Cluster.

So…  in the Notes for each AncestryDNA Match, we select some notation to indicate what this trend or pattern is. This notation would be the tentative “Cluster ID”. We could use a Surname [PLUNKETT]; or a Couple [PLUNKETT/HAM]; or the Ahnentafel for this couple [104/105] (or just the shorthand version: 104 – see a CA ID method here). Or, for Matches who have uploaded elsewhere, and we know the DNA segment(s), we could use that data (see one method, the TG ID, here). Feel free to use whatever system works for you to identify which Cluster you feel pretty sure about for this Match. If it’s not clear, just skip this Match and come back later (we’d do this a lot for Matches with Private or No or skimpy Trees). Note: I believe each Cluster is based on an Ancestral line. Clusters around a closer CA will probably have multiple TGs; a more distant CA will tend to have one TG.

A real aid in this process is MEDBetterDNA. It’s a Chrome extension, so you must use the Chrome browser (free). It has several features but the critical one here is that you’ll see all your Match Notes all the time (no need to click on the little “page” icon). Google MEDBetterDNA and use checkbox: “always show Notes”. It REALLY helps in looking down a long list of Shared Matches. [BTW: it would be very nice for AncestryDNA to make this standard…].

To use this process, we also need to use the Note box – we need to enter any CA or TG we find for a Match. I started with all my Hints – each one had at least one CA. And, as I looked over all my closest Matches, I found more CAs. Sometimes I found Matches at GEDmatch, which I could Triangulate and link to AncestryDNA Matches, giving me a TG in the Note box. Whatever system you’ve used to find cousins with CAs or TGs, enter what you’ve found in the Note box. Then, for all Matches over 20cM, you’ll see those Notes when they are in a Shared Match list. The homework assignment here is to enter Notes for as many of your 4th cousins (4C), or closer, as possible. Note that you’d need this same data in order to get anything out of a Match Clustering Matrix spreadsheet.

Then, starting with 4C (saving closer cousins for later), and look at each Match. See if you can tell from their Notes and the Shared Matches’ Notes what the Cluster would be. Maybe there will be multiple choices. Whatever it is, enter your Cluster ID in the beginning of the Note box. Go to the next 4C Match and repeat.  Skip any Match you want – this is an iterative process, and you may need to go through your list several times – I believe the Cluster IDs will “tighten up” – become more solid – with each iteration. At some point, even the Matches with Private/No/Skimpy Trees will have lots of Shared Matches with the same Cluster ID. Give that Match a Cluster ID, too!

After you’re satisfied with the 4C list, you can cycle back to the 3C list, and confirm that they are compatible with the trend of their Shared Matches. Each 3C may be associated with several Clusters. In fact some of your 4C Matches may have a few Clusters. This is OK – but multiple Clusters should be for adjacent ancestral lines which eventually converge (marry) at some level.

At this point, you can look at Matches beyond the 4C level. Many of my Hints with CAs are beyond 4C. Many of them will have Shared Matches (4C or Closer), and the Notes will point toward a Cluster ID. Although these distant Matches won’t show up in a Shared Match list, I’d still enter the Cluster ID in the Note box, just to keep track. You’d also need to list these separately – in a spreadsheet or on paper. However, if you put a hashtag, like #Cluster in your Notes, you can search on different Clusters. I just searched my AncestryDNA Results for #A0856 [my hashtagged CA ID] and 10 Matches popped up, including Matches with 6.3cM, 7.4cM and 13.2cM.

If I decided the above distant CA, #A0856, was a good Cluster ID, I’d enter #C0856 as the first entry in the Notes for all the Matches I thought were in that Cluster. Later, I could make a download and sort on the Note field to group all the Matches by Clusters. Or I could easily check my work against an Automated Match Clustering Program. Hopefully there wouldn’t be many differences.

The beauty – and benefit – of DIY Clustering:

  1. You can put a Match into more than one Cluster! Clustering programs have trouble with close cousins and multiple CAs/TGs – they don’t fit into just one Cluster. But what’s wrong with putting a Match into two or three Clusters if they really fit? Nothing – you are in charge with DIY Clustering.
  2. With Automated Match Clustering, you must have all your clues in place, up front. With DIY Clustering you can select which Clusters to work on first, and then get to the others later. Work at your own pace.
  3. DIY Clustering is primarily for AncestryDNA Matches, but you can also compare these Clusters with Match CAs and TGs from other companies. They should align and reinforce each other.

So, if you’d rather not use a Match Clustering program/spreadsheet, Do-It-Yourself. It involves entering Notes in a lot of Matches, but that is a good practice anyway. And the good news is you can adjust your Notes, and Cluster designations, as you go along. I actually believe we’ll get a better result with this DIY method, which we can easily tweak. I’m going to try it.

 

[22AD] Segment-ology: DIY Clustering TIDBIT by Jim Bartlett 20190218

Match Cluster Report 1

Here is a report on my first Match Clustering effort.

Background info:

  1. I used a download of my AncestryDNA Matches above 20cM (I only had a few real 3rd cousins (3C) and below, and I just left them in.
  2. I have made extensive use of the Notes for as many Matches as I can – all of my almost 1,000 Hints; and maybe 1/4, so far, of all my 4C and closer. NB: AncestryDNA uses 20cM as the threshold for 4C designations, but many Matches in this group are 5C and 6C and I’ve found some who are 7C and 8C, with larger than average shared segments over 20cM.
  3. For every Match I can, I put the Shorthand CA ID and/or Shorthand TG ID in the Note box for that Match. See the Explanation of Header row below for links that explain these IDs.
  4. For each Match, I also put a line in each Note which includes a summary of the CA and TG IDs found in all the Match’s Shared Matches (SM). So even a Match with a Private Tree, or No Tree, or scrawny Tree, or can’t-find-anything-in-it large Tree, will get a line summarizing their SMs. This summary often provides a very specific “pointer” to a CA and/or TG. And this added info is very helpful in analyzing Clusters.

 

When I ran the Cluster Matrix, I developed this summary report:

22ACa Summary of Match Cluster 1

Next is a spreadsheet with the 86 Clusters, re-sorted on the CA.

Explanation of Header row:

Cluster – the Cluster # in the Cluster Spreadsheet presented to me.

First & Last – the Match # range included in this Cluster (Matches go from 1 to 3571)

SMs – the number of Shared Matches in each Cluster – a wide range…

CA – the CA ID (an Ahnentafel # – see this blogpost). When various Matches had CAs from different generations, but all on the same line, I used the most distant CA – Walking the Ancestor Back. A few Clusters had multiple CA lines, but I used CAs that Walked Back or were repeated several times.

Gen – as a convenience, I noted the generations back to the CA

TGs – the TG ID (see this blogpost). I all cases (I think) the last two numbers in each TG ID (being the TG grandparent) are in agreement with the CA ID. A number of Clusters have multiple TGs.

NB: The CAs and TGs come from my typed Notes for some Matches (I just haven’t gotten to all 3,571 of them, yet). The Notes are based on valid data – from the Match or GEDmatch (i.e. not guesses by me), but I’m fully aware that some of it is not conclusive; and another, closer and/or different, CA may be found. The TGs should not change, but often a Match will have multiple TGs, and only one would apply to the specific Cluster or CA.

Figure 1. Summary of 86 Clusters

22ACb Figure 1 Summary of 86 Clusters

A few notes on this data:

  1. I am sure that, eventually, the Clusters at the top of this table will be found to link to more distant Ancestors – I just haven’t found them yet.
  2. I am sure that, eventually, the two Clusters in Gens 10 and 11 will wind up with different, closer CAs – I just haven’t found them yet (there are relatively few Matches in each of these Clusters)
  3. For the bottom 9 Clusters, I do have TGs, so I can use Matches from other companies (already included in these TGs in my Master Spreadsheet), to find likely (or at least possible) CAs. It’s just that no CAs have been determined yet at AncestryDNA for the Matches in these Clusters.
  4. In Gen 9, CA 856 is my prolific and well documented HIGGINBOTHAM Ancestor; and I’ve Walked this Ancestor Back in at least two TGs. There are several lines from this Ancestor who intermarried.
  5. In Gen 8, Cluster 61, over 100 Matches – this was a brick wall at Gen 5, until I found several dozen Matches in Gen 6-8 with CUMMINS/CUMMINGS Ancestry, which I have subsequently researched into one Tree – also a prolific line. And a new branch of my Tree!
  6. I’m sure there will be unfolding stories about other of these Clusters – I’m excited to see the way this is trending.

 

[22AC] Segment-ology: Match Cluster Report 1 – by Jim Bartlett 20190214

Walking the Match Clusters Back

A Segment-ology TIDBIT

It appears to me that the next step for Clusters is “Walking the Clusters Back.”

By this I mean, start with the original Leeds Method, 2nd cousins (2C) and 3C, which tends to result in 4 Clusters – one for each grandparent. Often, particularly with known 2C and 3C, you will be able to determine the grandparent for each Cluster.

Then adjust the shared segment cM threshold to focus on 3C and 4C and try to get 8 Clusters. This may take some fine tuning in the threshold, but if you get plus or minus one or two Clusters, that’s OK – just work around it. Now if you can tell from the Matches who were in the 4 Cluster Matrix who repeat in this nominal 8 Cluster Matrix, you know which two Clusters belong to each of the 4 grandparents. Then, if you can figure out the great grandparent in one of the two Clusters for each grandparent, then the other Cluster should be for the other great grandparent.

Once you do what you can with the 8 great grandparent Clusters, adjust the cM thresholds, and rerun a Cluster Matrix to shoot for 16 Clusters and repeat the process.

This would be Walking the Clusters Back. And, in the long run, it might be more efficient and accurate that trying to start with a small cM threshold and getting a large number of Clusters – 128 to 512 Clusters. As the number of Clusters grows, more and more Matches will be conflicting; and more distant Matches may well share more than one Common Ancestor with you. It just gets more complicated to sort out at the larger Matrix levels. Walking the Clusters Back will make this process easier.

And the absolutely great news – a huge benefit of Clusters – is that Shared Matches will cluster when they are Private, or have little or no Tree, or even when they have a robust Tree, but you cannot find any Common Ancestor. In other words no genealogy, nor TGs for that matter, are required to place a Match in a Cluster. Also AncestryDNA Matches who share less that 20cM can also be manually added to a Cluster, based on their Shared Matches. This is bringing “into the fold” Matches which normally would not be grouped. And putting these Matches into Clusters at any level, really helps when it comes to building parts of their Tree out to meet yours.

Match Clusters really fine tune our data. Happy dance… [HT: Dana Leeds]

 

[22AB] Segment-ology: Walking the Clusters Back TIDBIT by Jim Bartlett 20190214

Confessions of a Match Clusterer

A Segment-ology TIDBIT

I’ve been explaining and discussing and arguing about Match Clusters recently. One debate concerns whether a Cluster is formed around an Ancestor (CA) or a Triangulated Group (TG). I argue that Clustering tends to result in 4 or 8 or 16 or 32 Clusters (or some other number of Ancestors in a given generation), depending on the shared segment cM threshold used. It might seem like I know the Ancestors and/or the TGs for each of my Clusters.

Confession time – I do not!

I’m working hard to determine as many as I can, but the current status is still spotty. I’m having a fairly good experience with TGs (98% of my DNA is covered by TGs); and know some Cas (over 80% of my TGs are known to the grandparent level). But I still have a long way to go on Chromosome Mapping. “Walking the Ancestors Back” on each TG is the name of that game.

I’m fairly new to Match Clustering, and as I look over that data (from my recent Cluster Matrix of AncestryDNA Matches over 20cM), I see lots of bare spots. I do see some trends, but in no way have I determined distant Ancestors (the CAs) for each of my Clusters. Nor have I determined the TGs for each of my Clusters – some Clusters have multiple TGs, and many have no TGs (after all, this Matrix is based on AncestryDNA data). It will take a while to analyze and weigh the information I’m collecting.

I’m working on a better analysis and a report of the one Cluster Matrix I’ve tried so far – stay tuned!

 

[22AA] Segment-ology: Confessions of a Match Clusterer TIDBIT by Jim Bartlett 20190214