Icicles – Part 2 and Match Clustering

Posted on January 8, 2019 by Jim Bartlett

Let me start by saying this Icicle methodology, here, has not been as useful or accurate as I thought it might be. I don’t want to steer the readers of this blog in the wrong direction.

I’ve used this Icicle method and expanded the number of my columns of icicles a lot. Some of them turn out to be very helpful, with many Matches from the same Ancestral line and/or from the same Triangulated Group (TG). However, many are not so helpful. After all, these Icicles are just In Common With (ICW) lists. ICW lists are found at AncestryDNA (called Shared Matches); at FamilyTreeDNA (called In Common With); at 23andMe (called Shared Relatives); at MyHeritage (called Shared DNA Matches); and at GEDmatch (called People who match both kits).

Just a reminder: to get an ICW list, the program starts with a list of all your Matches (at that company) and compares your list with a list of all the Matches of a “base” Match (which you select) – the ICW list is a list of all Matches which are on both lists. In other words, the comparison is based on Match names, or kit IDs, and nothing more than that. In general, you and your selected base Match will have a shared DNA segment and a Common Ancestor (CA). Your ICW Matches may, or may not, share the same DNA segment and/or the same ancestral line. This provides powerful information when there is such an alignment; but it’s just a list of data – not much help – when there isn’t an alignment.

The good news is that 23andMe and MyHeritage both tell you when there is shared DNA alignment with a Match. 23andme puts a “Yes” in their Shared Relatives list; and MyHeritage adds a Triangulation “icon” in their Shared DNA Matches list when the ICW Match aligns (or Triangulates). GEDmatch lets you compare two kits, so you can check for a shared DNA segment; and their Tier1 Triangulation tool will list the top Matches which Triangulate. Since FamilyTreeDNA also provides segment data, we can check Matches in an ICW list to see if they are on the same overlapping DNA segment that you and the base Match share. This means they are in alignment over 95% of the time; virtually 100% of the time when there are multiple (say at least 4, not closely related) Matches who meet this ICW AND same segment criteria. See also the DoubleMatchTriangulator*.

The above segment Triangulations are a much more accurate and reliable way to group (or form Clusters of) Matches. However, this process is not available through AncestryDNA (although segment Triangulation can be accomplished on AncestryDNA kits uploaded to GEDmatch). For AncestryDNA Matches a good process is clustering Matches into groups – a good way to analyze your Matches at AncestryDNA – a step up.

One method to cluster Matches at AncestryDNA is the Leeds Method* – usually used to form cluster groups at the grandparent level, although some are pushing this a generation or two farther. At those more distant levels, some amount of judgment is needed.

Another method is my Icicle method, here, but this has turned out to be a lot of work, with mixed results – sometimes a good, helpful, thread is found; often one is not easily found, or one may not exist. There is no “rule” or argument that says an ICW list must have an ancestral thread. It’s logical that one may exist, given that you and the “base” Match have a Common Ancestor, and therefore some of the ICW Matches may have a higher probability of having the same CA. One tactic is to group the Icicles by ancestral lines or TGs, by moving such Icicle columns to be adjacent to each other and noting a common thread. However, it’s probably somewhat easier to use one of the tools below.

Several new methods for automatic Clustering have come out recently. GeneticAffairs* (small fee) now has an AutoClustering tool that puts all of your Matches (above a threshold) into a matrix, noting which are ICW each other, and then grouping them into matrix “boxes”. These “boxes” have a high probability of the same Common Ancestor, because there are multiple Matches in alignment with each other. Depending on the threshold, you might get 8 or 16 or 32 matrix “boxes” – representing 1, 2, or 3xG grandparents. NodeXL* also forms AncestryDNA Match clusters. And DNAGedCom Client* (small fee) has recently added a clustering tool.

Ideally these matrix “boxes”, or clusters, will group many of your Matches under the correct ancestor. They result in high probability outcomes; however, they are not perfect. Close cousins may be from several of these ancestor clusters and thus cause some confusion in clustering. But usually we know where the close cousins go. Also some Match cousins may share multiple CAs with you.

BOTTOM LINE – The best Clustering technique is Segment Triangulation – basically guaranteeing a Common Ancestor on a specific segment, IMO. I have a total of about 380 TGs that cover all of my DNA. Segment Triangulation is available for all the companies, except AncestryDNA. For Ancestry DNA, there are several Clustering techniques, noted above, that can be used to group Matches.

* More about these tools, and others, can be found through: https://isogg.org/wiki/Autosomal_DNA_tools

[19B] Segment-ology: Icicles – Part 2 and Match Clustering by Jim Bartlett 20190107

16 thoughts on “Icicles – Part 2 and Match Clustering”

Pingback: Manual Clustering From the Bottom Up | segment-ology
Michael on January 9, 2019 at 8:12 pm said:

Yes, to be more clear, the match on both sides I was referring to matches my mom on 2 segments on 2 chromosomes and also matches my dad on 1 segment on a 3rd chromosome. However, the match to my dad is only 6 cM, so perhaps not IBD anyway.
And yes, also, your process and rationale are very clear: each segment must be treated separately and TGs built for each one until it becomes evident who the MRCA is for that segment. That’s the bottom-up approach and ultimately, the “main thing” for all of us who wish to build the most comprehensive and accurate family tree that we can.
For me, somewhat early in the hunt, I’m very interested to achieve my first demonstrative success: build a TG around a segment with an unknown ancestor and use that group of segment sharers to uncover an ancestor in all of our trees. And having accomplished that once, I can then replicate success and refine my process. So that’s why I refer to the approach I’m taking now as top-down…I want to find one example among my shared segments that reveals an MRCA via a TG! And finding the mother of my ancestor that has so far just been a name on a birth certificate would be an enormous breakthrough, not just for me, but for my family.
-M

LikeLike

Reply ↓
Michael on January 9, 2019 at 5:26 pm said:

Thanks Jim, I was afraid of that…
Since I’m taking a top-down approach in this case, following a cousin-match with a specific surname and a paper trail to help me track down a missing ancestor, I was hoping that all 4 trails would lead to the same goal. However, as you point out, that is not necessarily the case. In fact, one of the “Yes” matches to this anchor match has 2 matching segments *and* matches both my parents, just as you said. So presumably, in this case, the 2 segments came down via two different paths.
My preferred approach for building triangulation groups (particularly when doing so with a specific ancestor as a goal vs. bottom-up analysis of many matches and segments) is to use DNA Painter. I haven’t seen that tool mentioned here, though you must have covered it. It’s a really fast and easy way to get all those segments into a graphical representation, and yet, all the numerical and contextual information is still embedded in the image when you need it.
So I placed this anchor cousin-match into DNA Painter – all four segments. Then, I began painting in all the “Yes” matches from the list on 23andMe, just segments over 7 cM. Of course, some lonely segments are scattered across the genome, but for the most part, stacks of segments begin to build on those 4 key locations. In this case, the smallest segment (9 cM) of the anchor matches’ 4 segments is the one with the most matches – now well over 30 segments that overlap that first clue. But many of those overlapping segments are much larger, and quite a few seem to have exactly the same begin and end points, with 19 cM.
Then, I go through all those matches that have been triangulated by 23andMe’s algorithms and mapped into DNA Painter, pick one of the largest clustered segments (in this case, there is an overlapping match with 25 cM) and go back to 23andMe for this cousin and look for the shared matches that he brings up (like fishing with larger bait?).
Based on your advice, I will do the same thing for each of the 4 segment locations, and, if possible, repeat the process for segments I can find on GEDMatch, MyHeritage, and Family Tree DNA (each with its own color for segments in DNA Painter). This will likely result in a very long list of matches on some or all of the initial 4 segment areas, but is also fairly manageable using DNA Painter’s excellent user interface.
What are your thoughts on this process relative to the other methods you have been discussing?
-Michael

LikeLike

Reply ↓
- jim4bartletts on January 9, 2019 at 6:43 pm said:
  
  Michael,
  Be careful if you are saying one segment matches both parents. It’s OK if one segment matches one parent and another segment matches the other parent.
  ALL of these tools and methods should lead to the same result – like a crossword puzzle, there is only one correct result. Your DNA came from your ancestors, down specific paths, to you only one way. Each segment came to you only one way.
  Some people prefer to use visual methods – Visual Phasing, Leeds Method, DNA Painter, AutoClustering Matrix, Kitty’s Chromosome Mapper, etc. On the other hand, I prefer spreadsheets, and seeing the threads in the Notes of Shared Matches at AncestryDNA. But make no mistake – they are all looking at the same data, the same segments you have in your body (from Ancestors) and the same Ancestors you have – these two elements (DNA segments and your Ancestors) compared to the same two elements of many different Matches. We are looking for agreement, for alignment, for corroborating evidence to determine the correct “solution”.
  It’s relatively easy with close Matches, and it gets harder and harder as the segments get smaller and the Match cousinships get more distant.
  Your process is a good one – it should work well for 16 2xG grandparents. In my spreadsheet, I have over 15,000 rows of Matches and segment data – all in 380 Triangulated Groups. This is a lot to show visually. I do enjoy showing my grandparents and great grandparents in a Chromosome Map (I have usually used Kitty’s Mapper Program, but can do the same “picture” with DNA Painter). And I use that “picture” to quickly determine if I have a reasonable number of crossover points per generation – remember each side *averages* about 34 crossovers per generation – it’s a rough quality checker.
  For me, my “next to the bottom line” is Matches grouped into TGs who will be from a Common Ancestor. I need a lot of such Matches in each TG, to find some with Trees and MRCAs. And multiple MRCAs on the same ancestral line are needed to confirm that the ancestral line is probably correct.
  The bottom line is finding MRCAs for these 380 TGs.
  Remember: The Main Thing, is to keep the Main Thing the Main Thing.
  
  LikeLike
  
  Reply ↓
Michael on January 8, 2019 at 11:42 pm said:

Hi Jim,
Your blog is simply the best resource on this subject I have found – thanks so much for this!
On the subject of 23andMe and their triangulation function: I have a cousin who matches on 4 segments on 4 chromosomes in amounts of 9, 12, 15, and 24 cM. If I click on “Yes” in the shared matches list, that draws the segments on a chromosome map and shows the overlap. Can I be certain that these 3 people share the same overlap of DNA on the same chromosome, not one on the maternal chromosome and one on the paternal)? Or do I need to run the two matches against each other separately?
And as a more general follow-on: if such a cousin matches me on these 4 segments and it appears that the surname and genealogy back this up (shared 2nd great grandparents), is it safe to assume that all 4 segments come down from the same ancestral couple? Or do I have to build triangulation groups for each of these 4 segments on the chance that they come down 2 (or more) lines?

LikeLike

Reply ↓
- jim4bartletts on January 9, 2019 at 8:35 am said:
  
  Michael,
  Thanks for your feedback.
  Unfortunately each shared segment must be viewed independently. Although closer cousins will usually have multiple shared segments from the same, expected, Common Ancestor, this is not a hard and fast “rule”. I have a number of Matches who share multiple segments with me, and some are on one parent, some on the other parent. But most are on the same side, but even there I see some of them on different grandparents.
  At 23andMe – when you have a “Yes” in the Shared Relatives, you need to check to see how many segments are involved. If there is only one shared segment, then, of course, the “Yes” refers to that segment. But with multiple shared segments, the “Yes” may refer to just one of them, some of them, or all of them – be sure to view them on the Chromosome Browser to see which segments really overlap. Jim
  
  LikeLike
  
  Reply ↓
info@geneticaffairs.com on January 8, 2019 at 7:58 pm said:

Thanks Jim, good that we are on the same page. I will look into the FTDNA data as well as the 23andme triangulation. Now I have to think about howto integrate these data. With respect to the TG data in the notes, I am afraid that unless it is becoming a new standard de facto standard, retrieving the TGs from the notes comes with some difficulties with respect to syntax. But like you said, we do display the notes section in the table, which should provide for a reasonable quick overview.

LikeLike

Reply ↓
- jim4bartletts on January 9, 2019 at 3:53 am said:
  
  The devil is always in the details. TGs are very personal. My TG would have different boundaries than every other Match who was in my TG. My 380 TGs have different boundaries that anyone else would have. Each of us is a unique jigsaw puzzle with differently shaped pieces (TGs). I wrote a blogpost about how I name my TGs – Chr# – location – 2 Ahnentafel to indicate the grandparent [24, 25,36,37]. An example would be 01S24, which happens to be a brick wall for which I’ve just determined a new surname. If you were in this TG, it could be through any of your grandparents, not necessarily 24. And on your Chr01, the letter code representing the TG start location may be Q or R or T or U. So a universal syntax that would give us both the same TG ID# is very elusive. Even tying it to a CA wouldn’t work as I have several TGs from this same brick wall. Maybe an Ancestor and his/her segment, but you’d have to complete and verify your chromosome map tied to CAs before you could do that.
  I agree with you that our best bet is to look at our Notes in the Auto Clustering boxes.
  
  LikeLike
  
  Reply ↓
EJ Blom on January 8, 2019 at 12:57 pm said:

Hi Jim, nice blog post and explanation. So what would really enrich the cluster of the AutoCluster analysis from Genetic Affairs would be an additional analysis that looks at the shared segments within the members of a cluster. And if segments triangulate, this would add an additional layer of evidence on top of the clusters. By the way, the first 8 AutoCluster analyses are for free on the Genetic Affairs site 🙂

LikeLike

Reply ↓
- jim4bartletts on January 8, 2019 at 2:18 pm said:
  
  EJ, thanks for the compliment. We may May have different perspectives here. What may enrich a Triangulated Group (TG), is the information from Match clusters. I estimate that Triangulation creates about 350-400 TGs. Segment Triangulation is a mechanical process – no genealogy required (but some genealogy is helpful to assign the TGs to the proper maternal or maternal side.) This means that for each of say 32 clusters, there are about 12 TGs. So if there are some Matches in a TG that clearly link to a cluster (and we knew the 3xG grandparent for the cluster), we could probably assume that TG had that 3xG grandparent as a Most Recent Common Ancestor (MRCA). This would be very helpful. However, we clearly should not ascribe one TG to a Match cluster, because there should be about 12 TGs in each cluster.
  Also, I think many of the naturally occurring TGs we each have, could be from fairly distant CAs – probably some beyond our genealogy horizons. So we have to “walk the ancestors back” toward an ultimate CA for each TG.
  If we were to chose a large TG threshold, so that we’d only have about about 32 TGs (to get a one to one ratio with clusters) we’d wind up with fewer that 1 TG for each of 44 autosomes – that would not be helpful.
  In summary, I think Match clustering can provide some input to solving the ancestry of each TG, but not the other way around.
  BTW, I am making extensive use of the AncestryDNA Notes box, and MEDBetterDNA to always display the notes I enter. I’ve written two blog posts about this. The two key building blocks of genetic genealogy (and chromosome mapping in particular) are genetic TGs, and genealogy MRCAs. When known, I put this info in Notes. I can then scan down the Shared Match list and pretty easily detect a thread. When Notes are available in a Match clustering, the same thread is available to help identify the cluster. The hard part is getting TG info for (from) AncestryDNA Matches.
  
  LikeLike
  
  Reply ↓
  - EJ Blom on January 8, 2019 at 3:59 pm said:
    
    Hi Jim, thanks for the detailed reply, very learningful. I was actually thinking of using the segment data (for instance from FTDNA, not sure if it is available for all matches for 23andme) to perform these TGs and see how they are enriched within certain clusters. I could imagine that if these TGs are calculated, that certain clusters have more of the members of these TGs as compared to other clusters.
    
    LikeLike
  - jim4bartletts on January 8, 2019 at 6:36 pm said:
    
    EJ,
    Adding TGs to Match clusters would definitely be a good thing. 23andMe and MyHeritage have already done the Triangulation – as I noted. When FTDNA Matches are on the same segment and are ICW (or in a Match cluster), there is almost certainty of Triangulation. I look forward to seeing this addition to your AutoClustering tool.
    
    LikeLike
  - jim4bartletts on January 8, 2019 at 7:47 pm said:
    
    EJ,
    
    I should also note that I now have segment data for hundreds of my AncestryDNA Matches who have uploaded to GEDmatch (or also tested at other companies). I put that (TG) info in my Notes at AncestryDNA, but I’m not sure how your AutoClustering could pick up that info and include it, automatically (other that just displaying the text in each Match’s Notes box).
    
    LikeLike
Elizabeth Wilson Ballard on January 8, 2019 at 5:17 am said:

A very good explanation. Thank you, Jim.

LikeLiked by 1 person

Reply ↓
Veronica on January 8, 2019 at 2:50 am said:

Couldn’t agree more Jim. You can’t beat triangulation IMO. All these clustering methods are great clues, particularly if you have some GEDmatch crossovers into your known TGs. BUT that’s all they are, clues.

LikeLike

Reply ↓
- jim4bartletts on January 8, 2019 at 2:55 am said:
  
  Veronica, Thanks. I’m find most clustering clues to be true and helpful. I’ve been able to use both TG and Cluster ancestors in queries to Matches – and often get back a positive/helpful response.
  
  LikeLiked by 1 person
  
  Reply ↓