Fuzzy Data, Fuzzy Segments – No Worry

Posted on May 30, 2015 by Jim Bartlett

The data we get from our atDNA test is not as precise as it may appear to be. But don’t worry, it doesn’t make any difference…

Let’s start with the 3 ways we measure our DNA first (see also my blog post Measuring Segments).

Base Pairs: Our DNA has about 3.2 billion base pairs in each set of chromosomes. The Human Genome Project sequenced about 99 percent our chromosomes in 2003. Since then, scientists have continued to refine the structure and arrangement of the base pairs. Most of our atDNA tests have been based on “Build 36”, but FTDNA shifted to “Build 37” a year or so ago. Because the atDNA tests only look at about 700,000 of those base pairs, the change from Build 36 to 37 is not much different. But most of us “lost” a few Matches and “gained” a few Matches as a result. To my knowledge, 23andMe and AncestryDNA still use Build 36, as does GEDmatch. The differences are slight.

cMs: The cM is an empirical measurement. There are differences in the observed cMs for males and females over the same segment (same start and end location). An average is used because the companies just don’t know the male/female ancestry of each shared segment. Besides, an average is much easier to work with… Even the tables of averages differs by company. Here are the totals per the ISOGG/wiki (without Chr X) [1]:

23andMe 3,537 cMs

FTDNA 3,384 cMs

GEDmatch 3,587 cMs

Per the CRC Research Group [2] we have very different totals of the average for males and females in the 22 autosomes:

Male 2,809 cMs

Female 4,782 cMs

Average 3,795 cMs

The reason this last average is greater that the averages used by the three companies is because there are certain areas of certain chromosomes that are blocked out from atDNA testing for genealogy (see the greyed area in your chromosome browser). So the companies only take the average of the areas covered (sampled) by the SNPs.

The important point to understand is that there is a wide variation between males and females. Using an average pretty much guarantees some inaccuracy. So definitely use cMs as a guideline, but don’t split hairs or make hard and fast “rules” with the cM values we get from the testing companies

SNPs: Each of the atDNA companies use a different number of SNPs per the ISOGG/wiki Comparison Chart [3]:

23andMe 577,382 SNPs

FTDNA 708,092 SNPs

AncestryDNA 682,549 SNPs

Note that 23andMe now uses a somewhat lower number of SNPs, and some of them are different from the other companies. But since the SNPs are basically a sampling technique over all of our DNA, we see little differences in the shared segments now reported by 23andMe.

So the start and end location of shared segments will tend to be different depending on where the terminal SNPs are. You might think with the largest number of SNPs, FTDNA can report a more accurate shared segment, but read on…

Shared segments are determined by a proprietary algorithm at each company. I don’t know them. And even if I did, I shouldn’t report it.

When determining 700,000 or so SNPs it’s hard to get all of them read correctly. There are invariably a few miss-calls and no-calls. Each company’s algorithm has an instruction as to handle these: ignore one (or a few?) of them and report a longer shared segment; or let a miss-call or no-call break up an otherwise long segment. Based on GEDmatch examples of kits from 2 companies for the same person, each company probably does it differently.

There is no “signpost” in our DNA to indicate where an ancestral segment starts or ends. Each algorithm looks for matches between two kits, and it may well run beyond the boundaries for a particular ancestor, and in these cases pick up, random, but matching, pieces of DNA from other ancestors, which would not be IBD. This makes some shared segments look larger than they should be. A classic example is a parent-child-Match trio where the child-Match shared segment is a little larger than the parent-Match shared segment.

The algorithms may also include some shortcuts. Notice the large number of segments at FTDNA which end in “00”. It appears some of their algorithm is based on looking at blocks of 100 SNPs at a time – in this case the size of a shared segment would be rounded down to only the blocks of 100 that match. The advantage FTDNA might have because of more SNPs in total, is offset by using them in blocks.

Here is a quote from the 23andMe Family Inheritance: Advanced page: “… segments can be measured in centiMorgans or in base pairs for mapping onto the genome. 23andMe rounds the segment length to the nearest tenth of centiMorgan and segment start and end coordinates to the closest millionth base pair to reflect the uncertainty in the exact locations of the segment boundaries.” [bolding by me]

AncestryDNA states their algorithm culls out segments based on “population” phasing. They also eliminate some pile-up segments, although it’s not clear (to me anyway) what size range they consider for this culling process. This process eliminates some IBS segments, but is also eliminates some IBD segments, too.

All of the above factors, and probably more, result in the DNA data being a little different, depending on the company – we can say the data is a little fuzzy, and not quite as precise as it appears to be. With this knowledge, I’m pretty sure cMs values are not accurate to two decimal places, and shared segments are not precise to a particular base pair.

Clearly this fuzzy data leads to fuzzy segments. The statement by 23andMe is a good one – round the segment start and end locations to the nearest million base pairs.

So let’s look at all of this fuzzinessfrom the Big Picture of Chromosome Mapping with Triangulated Groups – is it significant?

Each TG is a collection of overlapping shared segments (which match each other). As far as defining a TG is concerned, the only two SNPs that count are the first one and the last one. These SNPs are often from one of the segments that overran a little – either at the start or the end location. So the TG may be a little larger than indicated – the ends of the shared segments are a little fuzzy, so the TG is a little fuzzy, too.

When Chromosome Mapping is complete, you should have a bunch of TGs that cover each chromosome, from one end to the other (see Segments: Bottom-Up). I often use chromosome 5 in examples because it is about 200cM, and averages about 2 crossover points per generation. In my Chromosome Map, so far, I have 9 paternal TGs and 11 maternal TGs that cover my two chromosome 5s. From the detailed data, the tips of these TGs may overlap a little. By a little I mean maybe up to 1 or even 2 Mbp. Don’t let it worry you. In the Big Picture, if you have 10 TGs with various Common Ancestors on chromosome 5, you’re doing great! You’ve won! The fact that you don’t know precisely where each shared segment or TG starts or ends, pales when you run Kitty’s Chromosome Mapper [4] and see your ancestors mapped across chromosome 5. The little fuzziness is lost in the Big Picture. When you finish a jigsaw puzzle and step back to admire the full picture, you don’t even notice the outlines of each puzzle piece. For genealogy, you have achieved your objective – notwithstanding the fuzziness.

Another clue that fuzzy data and fuzzy segments are not an issue, is that my Matches who have tested at multiple companies, still share virtually the same segments with me. The shared segments are almost never identical in start/end locations, cMs or SNP counts. However, they always show up within a few rows of each other in my sorted spreadsheet. And they always wind up in the same TG!

So if you want to make life (TGs and Mapping) easier, use Mbp for segments. And since they are just guidelines, round off cMs to the nearest whole number. It won’t hurt anything. For me, overlapping segments and cM thresholds are just guidelines to group segments, and then form TGs (see How To Triangulate)

And if, someday, you decide to tackle genes, and you need to know exactly which ancestor gave you the gene that appears to straddle two TGs (meaning two ancestors), you can reexamine that junction more carefully. You can always refine the crossover points later, if you want. But for now, spend your energies in forming the TGs and determining Common Ancestors for them! I’m curious about which ancestors gave me which genes, but I doubt I’ll ever go back once my chromosomes are properly mapped to distant ancestors – I’ll be pooped and ready to try something else…

Reference links:

[1] http://www.isogg.org/wiki/CentiMorgan

[2] http://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC52322&blobtype=pdf

[3] http://www.isogg.org/wiki/Autosomal_DNA_testing_comparison_chart

[4] http://kittymunson.com/dna/ChromosomeMapper.php

03A Segmentology: Fuzzy Data, Fuzzy Segments – No Worry by Jim Bartlett 20150529

19 thoughts on “Fuzzy Data, Fuzzy Segments – No Worry”

Pingback: Visual Phasing of Chromosome 1 – Genetic Genealogy Girl
Pingback: Visual Phasing of Chromosome 1 – updated version using Stephen Fox’s Excel spreadsheet – Genetic Genealogy Girl
Pingback: Your TGs are pretty unique! | segment-ology
Pingback: Anwendung der Visual Phasing Methode an Chromosom 1 – Genetic Genealogy Girl
Pingback: Roughly Right is OK for Genealogy | segment-ology
Pingback: Raw DNA Phasing My Chromosome 20 – Hartley DNA & Genealogy
Pingback: Chromosome mapping with siblings – part 2 | DNAsleuth
Pingback: Anatomy of a TG | segment-ology
Peter Dukes on August 20, 2015 at 5:16 am said:

Thank you for this helpful information, Jim! My spreadsheet displays Mbp with one decimal point, and — although I’m in early stages of finding TGs — I have a general sense that the uncertainty in locations can be around 1 or 2 Mbp. Your post helps me confirm this.

I am curious about the converse question, though. Suppose several matches (let’s say on FTDNA) are reported to begin (or end) at the exact same location (to the nearest 100 bp). Is this meaningful? Is this location a good candidate for a crossover? And are these matches any more likely to be from the same chromosome (mom/dad)?

More generally, I wonder if key crossover points might be guessed as the locations (albeit fuzzy) which minimize the number of straddling matching segments? I ran a “straddling segments” count as a script on my spreadsheet and the results appeared to be interesting. That is, for each 1 Mbp interval on each chromosome, it reports the number of segments which contain it. Is this useful?

LikeLike

Reply ↓
- jim4bartletts on August 20, 2015 at 1:41 pm said:
  
  Peter – Thanks for the positive feedback. My spreadsheet is also in Mbp to one decimal point – I’ve even thought about dropping the decimal. When you think about 300-400 segments (TGs) spread over 45-46 chromosomes (each from an Ancestor), the big picture, with these broad strokes, is really all you need for genealogy.
  
  The crossover points are random, and different for each person. Yes, if you have several Match-segments that start or end at the same location, that is usually caused by a crossover point on your chromosomes. Think about it… The segment you got from an ancestor has a fixed location on your chromosome. All of your Matches (cousins from that ancestor) probably got a different, but overlapping, version of that segment. So what you “see” in a shared segment is only the overlap part. So you would not “see” any of your Matches segments earlier than your start or later than your end locations. So yes, the shared segments will tend to define your segment. In fact, that’s one of the benefits of Triangulated Groups – each TG really is the definition of the segment you got from the ancestor (as far as your Matches’ segments are spread out enough to allow you to define that segment). One interesting way to demonstrate this overlapping picture is to look at how the overlapping Matches in a TG compare to each other. You can do this at GEDmatch and 23andMe – just do a 1-to-1 at GEDmatch or compare them in FI:A at 23andMe. You’ll often see that they overlap beyond the boundaries of your ancestral segment. In fact, using the data from all the TG Match comparisons, you can determine a segment that your ancestor had to have had. He or she started with that larger segment, which got smaller as it descended to each of you.
  
  And yes, exact shared segment starts and ends may indicate the start/end of your ancestral segment – which segment would be on one chromosome. This works because of the randomness of crossovers. So it IS unusual to have the same start/end location on both chromosomes, but it is not a requirement or a hard “rule”. Certainly, you could make an initial grouping based on identical start/end points, and then confirm with Triangulation. It’s a judgment call. But, I do have areas in my spreadsheet where similar start locations alternate between paternal and maternal TGs. After you get a skeleton of TGs worked out (a partial chromosome map), you begin to see where the crossover points fall on the two sides, and new shared segments start to fall into place more easily.
  
  I, too, have pondered a “straddling” concept. Again, as the chromosome map develops, this straddling becomes clearer, and which side the segment is on is obvious. So, yes, this works most of the time. But Triangulation is the test. You have two things to watch out for (at least): (1) IBS – some segments below 10cM are IBS, they shouldn’t be on either chromosome or TG, regardless of the “fit”; (2) Close cousins – they will share larger segments with you which overlap two TGs. Think of it this way: two ancestors pass down small segments, one to a husband and one to a wife, who then pass these in separate chromosomes to their child, who then recombines them into a chromosome that he/she passed down. Matches to (cousins of ) the distant ancestors could only get the small segment from that ancestor; whereas a Match to (cousin of) the parent who passed down the DNA from both ancestors could see them both in one larger segment. To grasp this concept, think about the very large segment your father passed down to you – called chromosome 1 – which segment spans all smaller segments on Chr 1. Likewise the large segments from your grandparents or great grandparents are actually made up of smaller segments from more distant ancestors. The line is drawn when our segments get to the point that further subdivision results in segments below threshold.
  
  LikeLike
  
  Reply ↓
  - Peter Dukes on August 20, 2015 at 10:32 pm said:
    
    Thank you so much, Jim. Your posts and comments have a way of burying down to the exact details I need help understanding. The last paragraph was especially helpful.
    
    LikeLike
- Ann Turner on August 20, 2015 at 2:03 pm said:
  
  Peter, one addendum to Jim’s response: FTDNA “rounds” results to pre-determined blocks of 100 SNPs, so the segment boundaries aren’t as exact as they seem on the surface.
  
  LikeLike
  
  Reply ↓
Ann Turner on May 31, 2015 at 11:44 pm said:

Just a couple of minor corrections, which don’t affect your main conclusions at all. Both 23andMe and AncestryDNA use Build 37. FTDNA uses Build 37 for calculating matches but displays segment boundaries with Build 36 numbers. GEDmatch converts everything to Build 36.

No-calls don’t break up segments. 23andMe and FTDNA (and probably Ancestry) treat them as if they would match when they occur in the middle of a long continuous run of half-identical genotypes. GEDmatch doesn’t give credit for the SNP threshold, though.

Miscalls don’t necessarily break up segments, either. 23andMe and FTDNA (and probably Ancestry) have a built-in allowance for an occasional miscall due to genotyping error. Microdeletions can cause apparent miscalls (because the missing parental allele gets recorded as a homozygous genotype), which do break up segments.

The CRC data is very old, predating the completion of the human genome sequence. I would discount their numbers.

LikeLike

Reply ↓
- jim4bartletts on June 1, 2015 at 3:12 am said:
  
  Ann – thanks for your input. I’ll try to update my post. As you note, the Big Picture is that the data is fuzzy and the segments are fuzzy, but for genealogy, it’s OK to use the data and segments that are reported.
  
  LikeLike
  
  Reply ↓
jim4bartletts on May 30, 2015 at 11:44 pm said:

Bev
Great – sounds like you’ve got the bug.

LikeLike

Reply ↓
Bev Lang on May 30, 2015 at 9:49 pm said:

Thanks for that post Jim, it was very clear. Even more excited that I formed 9 TGs from my MIL’s chromosome 5 results. Even if three of them only include one person, they are clearly separate! I can also see the fuzziness of the start & end location values, and where some of the ‘missing’ matches might fit 🙂 Thanks again, Bev

Sent from Samsung tablet

LikeLike

Reply ↓
jim4bartletts on May 30, 2015 at 6:14 pm said:

Yes. The DNA is a tool. An accurate and pretty versatile tool. But the genealogist still has lots of work to do.

LikeLike

Reply ↓
caith on May 30, 2015 at 4:15 pm said:

Jim, thank you for this information. This complements what I already knew about the randomness of dna, and how it is further compounded by the methods of testing, chips, algorithms………………

Well, yes, DNA does not lie, BUT it is really random and incumbent upon us individually to sift and sort, and consider the variables to find the truth/s.

LikeLike

Reply ↓
- jim4bartletts on May 31, 2015 at 6:05 pm said:
  
  Caith – I have all of my Match/segments in a spreadsheet. I work with each one separately, and keep track of my correspondence with the Matches. Each TG takes on it’s own “personality”. Although tedious, I like working with data (rather than using a third party tool), because it gives me a sense of how the DNA works.
  
  LikeLike
  
  Reply ↓