Measuring Segments

Posted on May 13, 2015 by Jim Bartlett

For autosomal DNA, segments are measured 3 ways:

base pairs (bp) – these are the individual building blocks (molecules) that form each chromosome. Over the entire set of chromosomes there are 3.2 billion base pairs. Each chromosome has from 48 to 250 million base pairs. So a segment can be defined by the Start Location and End Location. Think of base pairs as a physical picture of a segment – it’s physical length and location on a chromosome.

Example: Start at 23,500,000; End at 46,300,000

Many of us round to the closest Megabasepair (Mbp). A Mbp is 1,000,000 bp

Example: The segment is 23.5-46.3Mbp

Rounding makes these numbers much easier to read and to type. And, in my opinion, there is virtually nothing lost in accuracy. Mbp is just fine for genealogy, triangulation and chromosome mapping. If you want to do some analysis of a particular part of a segment for some scientific or medical reason, you may want to use bp. (I’ll discuss “fuzzy data” and “fuzzy segments” in a separate blog post)

centiMorgans (cM) – I think of cM as a “quality” factor, or a “genetic distance” of sorts. The cM is the best measure we have of genetic distance, but it is far from perfect. The cM is empirically derived – that is scientists have recorded many observations and put them into tables. From these tables the cM value between any two points on a chromosome (as measured by bp) can be determined. In very general terms, more is better, and the larger the cM of a shared segment, the closer the Match would be. DNA is very random, and there are wide ranges of cM vs cousinship (including much overlap). See these references for more info [1], [2] and [3].

Example: A segment may be 15.4cM

SNPs – the single molecules (nucleotides), or base pairs, which show some amount of variation in human DNA. Most (99%) of our DNA is the same. For genealogy, we are looking for SNPs (sometimes referred to as markers) which are known to vary. The difference in our SNPs is what sets us apart. Basically each SNP can have one of 4 values: A, C, G or T. Each of the autosomal DNA testing companies uses a slightly different “chip” to determine these values, and they each effectively test a different number of SNPs – usually in the range of 600,000 to 700,000 SNPs. These are spread out over all of your chromosomes – think of them as a sampling of your DNA (a sampling of the most variable parts of your DNA). This might range from about 10,000 SNPs on the smallest chromosomes to over 58,000 on Chromosome 1.

Example: A segment may include 2,451 SNPs

Note that there is no firm correlation between these measurements. We can convert temperature measured in Centigrade to Fahrenheit because they both measure the same thing. All of the above measurements, measure different things. However, on average, there are about 100 cM for 100 Mbp

So, in summary:

A segment may be described: Chr 6: 23,500,000 to 46,300,000; 15.4cM; 2451 SNPs

A short cut description may be 6: 23.5-46.3Mbp; 15.4cM; 2451 SNPs

References:

[1] http://www.isogg.org/wiki/CentiMorgan

[2] http://en.wikipedia.org/wiki/Centimorgan

[3] http://compgen4.rutgers.edu/mapinterpolator

03 Segmentology: Measuring Segments; by Jim Bartlett 20150513

How To Triangulate

Posted on May 11, 2015 by Jim Bartlett

105

Here is a 3-step process for Triangulation: Collect, Arrange, Compare/Group.

Collect all the Match-segments you can. I recommend testing at all three companies (23andMe, FTDNA, and AncestryDNA), and using GEDmatch. But, wherever you test, get all of your segments into a spreadsheet. If you are using more than one company, you need to download, and then arrange, the data in the same format as your spreadsheet. Downloading/arranging is best when starting a new spreadsheet. Downloading avoids typing errors, but direct typing is sometimes easier for updates. I recommend deleting all segments under 7cM – most of them will be IBC/IBS (false segments) anyway, and even the ones which may be IBD are very difficult to confirm as such. You are much better off doing as much Triangulation as you can with segments over 7cM (or use a 10cM threshold if you wish), and then adding smaller segments back in later, if you want to analyze them. NB: Some of your closer Matches will share multiple segments with you – each segment must be entered as a separate row in your spreadsheet. The minimum requirement for a Triangulation with a spreadsheet includes columns for MatchName, Chromosome, SegmentStartLocation, SengmentEndLocation, cMs and TG. Most of us also have columns for SNPs, company, testee, TG, and any other information of interest to you. Perhaps I need a separate blog post about spreadsheets… ;>j

Arrange the segments by sorting the entire spreadsheet (Cntr-A) by Chromosome and Segment StartLocation. This is one sort with two levels – the Chromosome column is the first level. This puts all of your segments in order – from the first one on Chromosome 1 to the last one on Chromosome 23 (for sorting purposes I recommend changing Chromosome X to 23 or 23X so it will sort after 22). This serves the purpose of putting overlapping segments close to each other in the spreadsheet where they are easy to compare.

Compare/Group overlapping segments. All of these segments are shared segments with you. So with segments that overlap each other, you want to know if they match each other at this location. If so this is Triangulation. This comparison is done a little differently at each company, but the goal is the same: two segments either match each other, or they don’t (or there isn’t enough overlapping segment information to determine a match). All the Matches who match each other will form a Triangulated Group, on one chromosome – call this TG A (or any other name you want). Go through the same process with the segments who didn’t match TG A. They will often match each other and will form a second, overlapping TG, on the other chromosome – call this TG B. [Remember you have two of each numbered chromosome.] So to review, and put it all a different way: All of your segments (every row of your spreadsheet) will go into one of 4 categories:

– TG A [the first one with segments which match each other]
– TG B [the other, overlapping, one with segments which match each other]
– IBC/IBS [the segments don’t match either TG A or TG B]
– Undetermined [there are not enough segments to form both TG A and TG B and/or there isn’t enough overlapping data to determine a match.]
NB: None of the segments in TG A should match any of the segments in TG B.

At GEDmatch – the comparisons are easy. Just compare two kit numbers using the one-to-one utility to see if they match each other on the appropriate segment. The ones that do are Triangulated. You may also use the Tier1 Triangulation utility or the Segment utility. I prefer using the one-to-one utility and Chrome.

At 23andMe you have several different utilities:

– Family Inheritance: Advanced lets you compare up to 5 Matches at a time. You may also request a spreadsheet of all your shared segments; sort that by chromosome and SegmentStart, and check to see if two of your Matches match each other. The ones that do are Triangulated.
– Countries of Ancestry: Sort a Match’s spreadsheet by chromosome and SegmentStart, search for your own name, and highlight the overlapping segments. The Matches on this highlighted list who are also on overlapping segments in your spreadsheet are Triangulated (the CoA spreadsheet confirms the match between two of your Matches)

At FTDNA it’s a little trickier, because they don’t have a utility to compare two of your Matches. So the most positive method is to contact the Matches and ask them to confirm if they match your overlapping Matches, or not. The ones that do are Triangulated. An almost-as-good alternative is to use the InCommonWith utility. Look for the 2-squigley-arrows icon next to a Match’s name, click that, and select In Common With to get a list of your Matches who also match the Match you started with. Compare that list of Matches with the list of list of Matches with overlapping segments in your spreadsheet. Matches on both lists are considered to be Triangulated. Although this is not a foolproof method, it works most of the time. And if you find three or four ICW Matches in the same TG, the odds are much closer to 100%. Remember, every segment in your spreadsheet must go in one TG or the other, or be IBC/IBS, or be undetermined. If a particular Match, in one TG, is critical to your analysis, then try hard to confirm the Triangulation by contacting the Matches.

AncestryDNA has no DNA analysis utilities. You need to convince your Matches to upload their raw data to GEDmatch (for free) or FTDNA (for a fee), and see the paragraphs above.

Comments to improve this blog post are welcomed.

10 Segmentology: How to Triangulate; by Jim Bartlett 20150511

Does Triangulation Always Work?

Posted on May 10, 2015 by Jim Bartlett

I am sometimes asked if Triangulation “always” works. And with that question, there is always the follow up questions: with 15cM shared segments, 10cM segments, 7 cM segments, 5 cM segments, any size segments?

There are many things about DNA in genetic genealogy that are based on a distribution curve. And the distribution curve is based on experience. The classic example is IBD vs IBC segments. No one has reported, yet, any shared segment over 15cM that has proved to be not IBD. Very few examples exist for shared segments in the 10-15cM range. As we lower the range to 7cM, experience indicates that the percent IBD drops to about 50% range, give or take. The point is that there is a distribution curve, and we cannot say for certain that a shared segment below 15cM is IBD or not, just by the cMs.

Well, Triangulation is formed from shared segments. If we knew for a fact that the shared segments in a Triangulation were all IBD, then it would be an easy call. Given three IBD shared segments on the same segment area, from widely separated Matches, Triangulation should always work. But notice the qualifiers in that statement: three, IBD, same segment area, widely separated and should. This would be a very “tight” Triangulated Group, and we still need to say “should” because DNA is random and does not necessarily follow a set of rules like geometry.

In Triangulation we start with the shared segments reported by the various companies (23andMe, FTDNA and GEDmatch). The fact that three widely separated Matches match each other on the same segment, significantly increases the probability that the shared segments are IBD. So the IBD distribution curve based on cMs for a single shared segment, is shifted somewhat for Triangulation. Of course the question is how much.

In my experience, there have been a very few shared segments in the 10-15cM range that did not Triangulate. There are also some shared segments in the 7-10cM range that do not triangulate – the percentage goes up as you drop down to 7cM. This appears to be roughly in line with the non-IBD rate we see for shared segments. I have used a 7cM threshold for Triangulation for the past two years. I have not found any discrepancy, yet. About the end of 2014, I added shared segments in the 5-7cM range to my spreadsheet. Most of them did not Triangulate and were thus classified as IBS/IBC – this was expected. Some of them could not be categorized as there was no way to compare them (most comparisons at this level need to be made at GEDmatch). Some did Triangulate, and, so far, they have all “fit” in the TGs. A few of these have been very helpful.

I am comfortable saying Triangulation almost always works down to 5cM. The caveats include widely separated Matches, and an overlap of 5cM (estimated) for all segments. The TG “should” have one Common Ancestor. Eventually all TGs will be subdivided into even smaller TGs, and this will split the CAs between husband and wife – but that is another blog post, someday.

However, the question remains – what is the IBD distribution curve for TGs? At some point, as we reduce the cMs for shared segments in a TG, there will be IBC TGs. We still have the issue that algorithms can create IBC shared segments, so it’s reasonable to expect IBC TGs over a distribution range. There is no report I know of that addresses this distribution, yet. We may need to have completed chromosome maps for such an analysis.

11A Segmentology: Does Triangulation Always Work; Jim Bartlett 20150510

Benefits of Triangulation

Posted on May 9, 2015 by Jim Bartlett

Benefits of Triangulating Autosomal DNA Shared Segments

Grouping – Triangulation of segments shared with your Matches will put most of them into Triangulated Groups (TGs). If nothing else, this organizes your list of shared segments.

Common Ancestor (CA) – all the Matches in a TG will have the same CA*. The shared DNA segment was passed down from the CA to each Match (which is why the segments match!). Each Match pair in a TG will have a Most Recent Common Ancestor (MRCA) between them. Between different Match pairs in a TG there may be different MRCAs – but all the MRCAs descend from the same CA.

Team synergy – all the Matches in a TG are cousins to each other; and they all have the same goal: determine the CA. They are automatically a research team. Think of the synergy of such a Team, all focused on the same goal. Each sharing their own insights and info. All working together…
Multiple Common Ancestors – You may have more than one Common Ancestor with a Match (it’s not uncommon with Colonial American or Ashkenazi Jewish or other endogamous populations). TGs let you sort this out. TGs help you determine which one provided the shared segment (and is thus proved by matching DNA.) When several Matches in a TG can agree on the same CA, we have genealogy triangulation. This means several widely separated cousins have the same CA, so it’s highly probable that is the correct one; and all in the TG should also have the same one.

Multiple Shared Segments – You may have some Matches with multiple shared segments with you. Usually, these shared segments will be from the same CA, but not necessarily. You could be related to a Match different ways, on different segments. Again, the Matches in a TG need to agree upon the correct CA, and in this way you can determine the few Matches that are related differently on multiple shared segments.

Eliminate false segments – the shared segments that come from a CA are called IBD (Identical By Descent). Not all of the segments identified by a company as a “matching” or “shared” segments are IBD, some are false positive matches. They were made up by the computer algorithm that searches for matching segments; and they typically include pieces from both of your parents which are not from one Ancestor. By forming TGs, there will be a few segments that overlap, but don’t match either of the two TGs for that chromosome area. These are false positives – often called Identical by Chance (IBC) or Identical by State (IBS). So TGs will identify and eliminate these segments – they are not in TGs.

Form “Pointers” – When a TG includes a known close cousin, the cousin provides a “Pointer”. You and the cousin have an MRCA – grandparents, great grandparents or more distant grandparents. Other Matches in the TG are usually more distant, but the CA for all of you in the TG has to be ancestral to the MRCA that you and your close cousin share. So that MRCA forms a Pointer to where the CA has to be.

Brick walls – as an example, let’s say you have 3 Matches in a TG: a first cousin, a fourth cousin and a distant Match. You know the MRCA with the first cousin, and quickly extend that back to the MRCA with the fourth cousin. You have a “Pointer”, but you have a Brick Wall beyond that. Working as a Team, your fourth cousin and the distant Match determine an MRCA that is one or two generations beyond (ancestral to) the MRCA you have with your fourth cousin. Very probably this MRCA (previously unknown to you), will be the parent or grandparent of the above MRCA with your fourth cousin – and you have worked through a Brick Wall.

Adoptee involvement – TGs are mechanically formed. No genealogy required. Adoptees can do this with several advantages. Their TGs validate your TGs (everyone in a TG should share most of the Matches). Their TGs may add additional Matches to the TG. The adoptee can analyze the Trees and genealogy data provided by the other Matches. An adoptee can serve as an arbiter, facilitator, even Team leader of a TG. The adoptee brings diversity and analysis to the TG Team. This is an excellent way for adoptees to “get involved”.

Mapping – TGs define each person’s chromosome map – their personal jigsaw puzzle. Each TG describes a puzzle piece. Once you’ve formed as many TGs as you can with your Match segments, you have a better picture of your map. With sufficient TGs, they will be heel-and-toe on each chromosome, even if you don’t know which side the TGs are on. Assigning TGs to a maternal or paternal side (a maternal or paternal chromosome) involves genealogy. But once TGs are assigned, they lock in the structure of your chromosome map.

Expanding TGs – As additional Matches are added to your Match List each week, they can usually be added to the appropriate TG fairly easily. Each new Match-segment will match (or be In Common With) several of the segment in one TG or the other in this chromosome area. If the new segment does not match either TG (usually with a segment size 7-10cM), then segment can be categorized as IBC. Also the TG boundaries will firm up.

TG names [extra credit] – now that you have TGs you can give each one a unique name – I recommend start with chromosome number, 01-23, add a letter, A-Z, to indicate about where on the chromosome it is; and then letters M and P, to indicate Maternal or Paternal chromosome (or A and B if you don’t know, yet). Examples: 07GM or 18BP. This can be used as a filing system, as each one will tie to a particular ancestral line for you, and all the info you collect from Matches or research will apply to this TG. Experience indicates you’ll have 400 TGs or so.

Crossover points [advanced topic] – the TGs define the recombination crossover points unique to your chromosomes. Using the recombination knowledge we have (roughly 1 crossover per 100cM in each generation), we can check our chromosome maps against this knowledge. So if we already see 4 alternating large blocks from grandparents on Chromosome 1, with an unknown block in the middle – we can be pretty confident that that block is from the grandparent that leaves us with 4 blocks, rather than now having 6 alternating blocks.

Phasing not required – No phased data is needed to form Triangulated Groups.

Phasing equivalent – Phasing is separating the values of 700,000 SNPs a person (usually you) got from each parent. For instance, a person’s maternal phased atDNA would match his/her mother’s DNA 100%. When you form a Triangulated Group on your maternal side, you know that the TG segment has exactly the same string of SNPs as your mother’s DNA would have for the same location. You don’t know the values of those SNPs, but you don’t really need to know. A TG is basically a phased segment.

Multiple Ethnicities – TGs will often pretty clearly show different ethnic groups. This is particularly true if parents are from very different admixtures (African, Ashkenazi Jewish, Melugeon, Native American, TimBukTu, etc.) We might also expect ancestries from England or Scandanavia to group differently; or ancestries from Germanna, Pennsylvania Dutch, or other enclaves, to group together. Multiple Ethnicities may even aid in assigning TGs to their respective sides.

* “All the Matches in a TG will have the same CA…” Actually some TGs may span large segments (15cM or more). In these cases a close cousin is often involved. These TGs will actually subdivide into smaller TGs, which will have different CAs. Much more on this in later blog posts. In any case, the segments which tightly overlap each other will have a single CA. But be careful with spread out TGs – they may subdivide and have different CAs.

09 Segmentology: Benefits of Triangulation; Jim Bartlett May 2015

What is a segment?

Posted on May 7, 2015 by Jim Bartlett

A DNA segment is a block, chunk, piece, string of DNA on a chromosome. It is typically determined by a start location and an end location on a chromosome. A segment refers to all the DNA in between and including the start and end locations.

We use the term segment in at least two fundamentally different ways:

An ancestral segment is one which is passed down from an ancestor. Ancestral segments are passed to you from your parents, who got them from their ancestors. Each of your chromosomes are made up of ancestral segments – much more on this later.
A shared segment is one which both you and a match have. Both you and your match have segments which are identical from start to end. Also sometimes called an HIR (Half-Identical Region). Also sometimes referred to as a matching segment. Note: a shared segment is determined by a computer algorithm – it may or may not come from a common ancestor – much more on this later.

IBD: When a shared segment comes from a common ancestor, we say it is IBD (Identical By Descent). Both you and your match have these identical segments on a chromosome because these segments came from the same ancestor.

Note that IBD shared segments are based on ancestral segments. The ancestral segment that you and/or your match received from an ancestor, may be (and often is) larger than the shared segment. What you see in the DNA match lists, reports, tables, or chromosome browsers, is the overlapping portion of your ancestral segment and your match’s ancestral segment. This overlap is the identical part that is reported as a shared segment.

IBC: Sometimes a “shared segment” does not come from an ancestor. The computer algorithm creates the apparent shared segment from parts of the DNA which are not all from one ancestor. This can happen in your “segment”, your matches “segment”, or both. Thus, the fact that they appear to be identical to the algorithm is by chance. We refer to these segments as IBC (Identical By Chance) or IBS (Identical By State). In any case these shared segments do not exist on one chromosome for you and/or your match, and they both therefor are not IBD – much more on this later. The “shared segment” or “matching” segment is therefor IBC.

a blog about segments and autosomal DNA

Posted on April 14, 2015 by Jim Bartlett

Launching Segment-ology, a blog about segments and autosomal DNA. This will be a series of articles about the various aspects of DNA segments and how they are used in genetic genealogy. My desire is to help genealogists get the most out of their atDNA results.