Pile-ups

What’s all the buzz about “pile-ups”?  In my mind there are three kinds of pile-ups: small, medium and large. They are different, so it’s important to understand each one. In this case Goldilocks should prefer the large pile-ups, but let me go through my views of all three kinds.

Alert: This post contains my opinions about small pile-ups and AncestryDNA (based on my own experience) so you should make your own judgments.

Background

I think the two keys to success with autosomal DNA lie in a robust Tree (as many ancestors out to 13 generations as possible) and as many Match-segments as possible (including as many close relatives as you can get). I spent about a year expanding my Tree as best as I could, and then posted that GEDcom in several places. I’ve tested at all three companies and use GEDmatch.  I put every single shared segment I can find over 7cM into my spreadsheet, and I periodically run a Quality Control check against a fresh download to pick up any missed Matches or segments. I currently have 5,000 different individuals with segment data in my spreadsheet, and have determined a Common Ancestor (CA) with 309 of them.

I have compared virtually every segment against other overlapping segments, and formed Triangulated Groups (TGs) that cover over 90% of my 45 chromosomes. It is now rare for me to get a new shared segment that changes my chromosome map in any way. This process has provided some insights on medium and large pile-ups.

Pile-ups

My definition of pile-up sizes:

  1. Small is smaller than 5cM
  2. Medium is 5-10cM
  3. Large is greater than 10cM

Small pile-ups – by my definition, these pileups are composed almost entirely of IBS shared segments. When AncestryDNA first rolled out their autosomal DNA test, their threshold was 5Mbp. This threshold included many shared segments well below 5cM, and resulted in many thousands of bogus Matches. To their credit, they provided a caution about these. When AncestryDNA revised their threshold to 5cM, many of these Matches went away. Part of their explanation was the elimination of “pile-ups”.  I agree that these “small pile-ups” should be eliminated. And when they reset their threshold to 5cM, that should have eliminated this problem. However, their explanations continue to stress the elimination of “pile-ups”. I just hope they don’t also toss out Matches in larger pile-ups – throwing the baby out with the bath water.

Medium pile-ups – 5-10cM range. As I gathered as many segments over 5cM as I could and sorted them in my spreadsheet, I noticed a few areas that had many such segments, all in a very narrow chromosome area. Very clearly a pile-up! Virtually none of them matched each other, although they had almost the same segment start/end locations. And there were a lot of them – many more than in large TGs.

In discussions on various email lists, we compared notes, and found that most of these areas were unique to our own experience. In general they were not due to some common feature of most human genomes. A notable exception to this blanket statement is the HLA Region on Chromosome 6 – roughly from 29.8 to 33.1Mbp.

However, most of the other areas were not tied to known issues like the HLA Region. In my analysis, it was not possible for me to link these to one parental side or the other. The fact that these areas include so many IBC segments indicates to me that it’s the combination of both of my chromosomes (maternal and paternal) that allows the “matches”. It’s the unique combination of alleles in these small stretches of DNA that make matching much easier. And this unique combination is only in my genome. On chromosome 18, I have 307 segments in the 7 to 11 cM range. They are all in a very tight area:  from location 5,800,000 to 8,700,000bp.  Very few of them triangulate.

Sometimes the pile-up area has been documented. On chromosome 15, I have 281 segments in the 7 to 10cM range. They are at: 24,000,000 to 28,000,000 bp. This area partly overlaps a known pile-up area (20,100,000 to 25,200,000). But the known pile-up area is only partly the cause in my case. See 14 small pile-up areas found by Li et al (2014), listed at the ISOGG Wiki: http://www.isogg.org/wiki/Identical_by_descent These medium pile-up areas, and a few others in my experience, are characterized by a very tall pile-up of many segments about the same size in a narrow area just a little larger than the segments. The Li et al (2014) article refers to “regions where excess IBD is detected…” Virtually all of the segments I have noted above are IBS/IBC – they do NOT triangulate with the other segments.  A few segments in these regions do triangulate with known close relatives, and each other. I’ve kept those segments in maternal and paternal TGs, as appropriate, covering that area. After all, both my mother and father gave me those areas, and they in turn got them from their parents, etc.  It is very probable that these segments are IBD and come from a CA.

My experience is that these are areas with a lot of shared segments in the 7-10cM range that are in a tight area, usually just 10cM wide, and a very high proportion of these segments are IBS/IBC.  A few segments in these areas will be IBD, but they will tend to be larger than the 7-10cM segments.

My bottom line for these pile-ups: Unless you have a lot of free time, skip over these areas – particularly the shared segments under 10cM. Concentrate on triangulating any larger segments in these areas and then move on to other areas.

Large pile-ups – these are my favorites. Larger shared segments (over 10cM) that spread out and overlap each other over wider areas.  These segments tend to triangulate with each other, forming TGs on both sides.  I have some of these TGs which include over 50 shared segments.  Since the shared segments triangulate with each other, this is a good pile-up. These TGs are large because more people have these shared segments – probably because the Common Ancestors had large families in Colonial America, leaving us with many, many cousins. Another reason could be a more distant Common Ancestor, who would also leave us a large number of cousins.

In some cases we can use this observation to our advantage. I have a 2nd cousin, on his mother’s side, who is also an 8th cousin, on his father’s side. Our close Common Ancestor was an immigrant to the US in the mid-1800s, and I get relatively few Matches on the segments I share with him. However, on one segment, we have many Matches – it turns out our Common Ancestor is on his father’s side. The tip-off should have been the size of the TG (measured by the number of Matches).

Another observation about large pile-ups…. They will get larger. The number of folks taking an atDNA test is about doubling every 12 months. A consequence of this is that all of our TGs will also double in the next 12 months. So, if you have pile-ups now, they will about double by this time next year. Use these larger TGs to your advantage – work with the Matches to investigate place/time matches, if a Common Ancestor is not easily determined.

Summary

  1. In general, don’t work with shared segments below 5cM. Most are IBS/IBC – even if they appear to triangulate. We don’t have a good test below 5cM to indicate IBD.
  2. Watch for, and avoid, pile-ups in the 5-10cM range. These are characterized by many shared segments in the 5-10cM range in a very tight location- usually only 10 or 11cM wide. Move on to larger shared segments in other locations.
  3. Embrace the large pile-ups. They may from Common Ancestors with large families and/or more distant Common Ancestors. In either case, work with the Matches in these TGs as a Team to determine the Common Ancestor.

18 Segment-ology: Pile-ups by Jim Bartlett 20151007

Anatomy of an IBS segment

This is a guest blog post by Dr. Ann Turner, who has been a great mentor for me.

Anatomy of an IBS segment

 Ann Turner

DNACousins@gmail.com

October 1, 2015

Jim Bartlett, my host for this blog post, shares a 7.8 cM segment at 23andMe with my nephew Larry. This was a serendipitous find, for Jim broke down a brick wall for me with records from an orphan’s court. In turn, I provided a solution to a minor mystery for Jim – where did John Henry go when he disappeared from Frederick County, Virginia?

That discovery was back in 2011, before we had developed much in the way of techniques to analyze segment data. There was one troubling aspect:  Jim did not match my sister (or her husband, either). This could be explained away if there was a false negative in my sister. Fast forward to 2015. Jim’s intensive work on triangulated segments has filled in the section containing Larry’s segment with more cousins. Larry did not match anyone on either one of Jim’s chromosomes.

Is it possible that this match was not Identical by Descent (IBD), but just Identical by State (IBS)?

A Terminology Detour

The terms “Identical by Descent” and “Identical by State” predate their application to segmentology, Jim’s felicitous term for analyzing autosomal DNA. The glossary in Human Evolutionary Genetics[1] contrasts the two phrases:

Identity by Descent: Property of alleles in an individual or in two people that are identical because they were inherited from a common ancestor; as opposed to identity by state

Identity by State: Property of alleles in an individual or in two people that are identical because of coincidental mutational processes, and not because they were inherited from a common ancestor (identity by descent)

In effect, “identical” is the more general word, and the phrase describes two mutually exclusive ways of achieving identity – BY state or BY identity.

Also, the definition is about alleles, alternative versions of a single marker. There are examples in genetic genealogy when we look at the type of DNA that follows one line, the straight paternal line or the straight maternal line.

For the Y chromosome, the ancestral haplotype may sometimes be deduced from multiple lines of descent. The question then becomes whether a variation on the theme marks a specific line: does the fact that two individuals both share a one-step difference from the ancestral haplotype on DYS19 mean that they have identified a branch tag to a more recent common ancestor (the mutation is identical by descent), or did the mutation occur independently in two different lines of descent (the mutation is identical by state)? The mutation rate is high enough that either explanation could hold true.

For mtDNA, there are certain hotspots where a mutation is not a reliable indicator for defining haplogroups or even genealogical relationships. A mutation 16519C has occurred independently hundreds of times in different haplogroup subclades, and insertions at 309.1C (and 309.2C) are frequent enough that even siblings are known to differ.

Adapting the two terms IBS and IBD for segmentology stretches the original context to include regions of the genome, not just single markers. Furthermore, the mutation rate for autosomal DNA is orders of magnitude lower than Y-STRs or mtDNA. Differences in two autosomal markers are not likely to be due to a recent mutation.

With this shift to testing multiple autosomal markers, some authors began to employ the phrase Identical by State as the broader concept. Then some, but not all, IBS regions would also be Identical by Descent. That leaves a vacuum – what should we call regions that are IBS but not IBD? Charles Brenner created his own term, which is not particularly evocative but illustrates the frustrating dilemma:

“Identical by state” (IBS) as used here is synonymous with “identical”, an umbrella meaning in that IBS  thus includes IBD as a subset. Adopting the umbrella definition for IBS means some other term may be needed to mean IBS but not IBD and for this purpose I use the word “strict.”[2]

Indeed, it appears that many technical articles avoid the term IBS entirely. A search of Google Scholar  yields 17,100 citations for Identical (or Identity)  by Descent but only 4,700 citations for Identical (or Identity) by State. Scanning a small sample of those articles reveals that they often describe a segment as IBD or “not IBD”, period.

My personal preference is to hew to the original concept, where identity is the broader, more general term. It avoids the awkward need for a special term to describe IBS but not IBD. Plus in the future, when we can do whole genome sequencing, reserving IBS for accidental identity due to a parallel mutation may become more relevant. In spite of the low mutation rate, the vast number of loci and (perhaps) the large number of tested people will result in a certain number of recurrent mutations. We are already seeing this with more comprehensive sequencing of the Y chromosome.

I have no objections to those who prefer IBS for the more general term, but for the purposes of this blog post, I mean Identical “just/merely/only” by State. For further clarity, we need to emphasize that we are speaking of HALF identity, where at least one of the two alleles in one party’s genotype matches at least one allele in the other party. Leon Kull coined the acronym HIR for Half-Identical Region. That obviously leaves a lot of wiggle room, as shown in the next section.

Dissecting the Segment

Jim graciously shared his raw data with me so I could use Excel to view each and every one of the 850 SNPs in the segment. (See Supplemental data file.) The segment boundaries are opposite homozygotes (e.g. CC and GG) – they do not match at all. Figure 1 shows some of the column headers in the spreadsheet with a few sample rows of data.

Columns A, B, and C give the chromosome number, chromosome position, and SNP ID as found in the raw data download. They are redacted here for privacy reasons, but the column labels are preserved for those who would like to use the spreadsheet as a template for their own analyses.

Column D is for Jim’s genotype data. If Jim is homozygous for a marker (e.g. CC), then he obviously received a C from his father and a C from his mother. If Jim is heterozygous possible alleles are always listed in an arbitrary order (often alphabetical). The C allele could have come from his mother and the T allele from his father, or vice versa. Columns E, F, and G are genotype data for Larry, his mother, and his father.

I also phased Larry’s data so I could tell which allele came from which parent, using David Pike’s utility Phase a Child when given data for child and both parents for the calculations. In a separate step (not shown here), I reformatted the results and loaded them in to the spreadsheet so the rows aligned with Jim’s results. Column H has the maternal allele (from my sister) and Column I has the paternal allele. The results could not be phased in cases where all three parties were heterozygous, and the genotype is retained. A heterozygous result is a universal match – no matter what Jim’s genotype is, at least one of Larry’s alleles will match at least one of Jim’s alleles, because each SNP has only two possible versions.[3] The full spreadsheet can be seen at this link.

Ann Taylor Figure 1

Figure 1

Columns J and K use Excel formulas to show whether Jim matches the maternal allele and/or the paternal allele (coded with a “1”). Conditional formatting shows pink for a maternal match and blue for a paternal match. It’s readily apparent that a mismatch in the maternal side is filled in by a match in the paternal side, and vice versa. Figure 2 shows this pink and blue pattern horizontally (similar to a chromosome browser) for a somewhat longer stretch of 31 SNPs.

Ann Taylor Figure 2

Figure 2

The remaining columns in the spreadsheet (L through T) contain calculations used to generate some summary statistics:

1) The apparent long run of 850 half-identical SNPs is broken up into 61 shorter runs on the maternal side and 31 shorter runs on the paternal side. It is entirely possible that these runs would be fragmented even further if Jim also had phased data.

2) Jim and Larry are both homozygous for the same allele for 368 of the SNPs. If Jim inherited the same allele from his father AND his mother, and ditto for Larry, it seems likely that the allele is rather common in the general population. That makes for easy pickings.

3) Jim is heterozygous for 310 SNPs and Larry for 311 SNPs, about 36%. There are 482 SNPs where at least one party is heterozygous (57%). These are universal matches.

Most segments of this length will actually be IBD.[4] This example is somewhat exceptional, deliberately chosen to dramatize the possible pitfalls and serve as a warning about smaller segments. One explanation may be that the 36% level of heterozygosity happened to be particularly high for this one region. The overall average for Jim and Larry was 28.2% and 30.7% respectively

Red flags were waving for this segment: the lack of triangulation and the lack of a match for both of Larry’s parents. Is the converse true? Can triangulation or a match in a parent prove IBD? No, many counter-examples can be found, especially at shorter segment lengths.[5]

Phasing is our most pressing need, yet it is not always available.[6] Any alternative methodology for claiming that certain short HIRs are IBD must be able to demonstrate that the segment survives in test cases where the phase is known.

One more moral of the story: a genealogical connection can be made without DNA!

[1] M.A. Jobling et al, Human Evolutionary Genetics: Origins, Peoples & Disease, Garland Science, 2004.

[2] Brenner CH. Understanding Y haplotype matching probability. Forensic Sci Int Genet. 2014 Jan;8(1):233-43. http://dna-view.com/downloads/documents/Understanding%20Y%20haplotype%20matching%20probability.pdf

[3] It is possible to have three or four alleles (A/C/G/T) for a SNP, but these are rare and SNP chips tend to avoid them.

[4] According to 23andMe’s simulations “IBD segment lengths [i.e. HIRs] greater than 7 cM were observed 90% of the time in at least one parent. Preliminary data suggest that 7 cM segments shared between a distant cousin and child that were not observed in the parents were due to false negatives in the parents.” Henn BM et al, “Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples.” PLoS One. 2012;7(4):e34267.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3317976/

[5] See my blog post http://www.thegeneticgenealogist.com/2015/03/30/guest-post-what-a-difference-a-phase-makes/ for details on how an experimental phased data file eliminated a large number of small segments reported by Family Tree DNA.

[6] AncestryDNA phases data for its internal calculations, but the raw data download shows genotypes with the alleles in an arbitrary order.

Small Segments and Triangulation

How small can we go with triangulation?

We have anecdotal information to indicate “almost all” shared segments above 15cM are Identical By Descent (IBD). There is always a tail on the distribution curve of random events, so we cannot say 100%.

From my experience (mapping over 90% of my chromosomes) I am confident that triangulation can tighten the distribution curve so that “almost all” segments down to 7cM in a Triangulated Group (TG) are IBD. I say this because I find some 7-10cM shared segments which do not triangulate with the TG on either the maternal or paternal sides. Although several segments in each TG triangulate with each other, some shared segments, with the same “address”, do not. To me this is proof positive that these shared segments which do not triangulate must be Identical by State (IBS), meaning, in this case, not-IBD. And the number of such 7-10cM shared segments which don’t triangulate, and are thus IBS, seems to generally agree with the percent IBS in the ISOGG/Wiki: http://www.isogg.org/wiki/Identical_by_descent

However, the fact that the segments in a TG do triangulate does not, in my mind, provide a 100% guarantee that they are all IBD. The same is true for a random shared segment in the 10-15cM range – most, but not all, are IBD. But in the aggregate, when we have say 20 shared segments in a TG, usually of various cMs, this pretty much defines that area of the chromosome as coming from an ancestor. If 1 or 2 of those triangulated shared segments turns out to be IBS, it’s not harmful in the grand scheme – we are looking for a Common Ancestor (CA) for the TG, and generally find only a few Matches in the TG who have a robust enough Ancestral Tree to help with this goal. We are looking for several such Matches to confirm the same CA. Having a close cousin in the TG, increases our confidence in the CA. As our Match list doubles over the next 12 months, so too should the number of Matches in each TG, adding to the preponderance of evidence for both the TG and CA. The key is that several distant cousins all agree on the same CA for the TG – this, too, adds to our confidence level.

My chromosome mapping has resulted in about 350 defined TGs which are adjacent to each other (“heel and toe”) – covering long stretches of each of my 45 chromosomes, with only a few bare spots over 10cM. All new Matches have shared segments which easily “fit” into, and triangulate with, existing TGs – except a small percentage in the 7-10cM range which don’t and are then labeled IBS. This has also added to my confidence that triangulation, down to 7cM shared segments, is a good process. The outline of my chromosome map is coming into sharper focus, with fairly well defined crossover points, and ambiguities are fading away.

With this “success”, I’ve been including shared segments in my analysis down to 500 SNPs and 5cM by adjusting the thresholds at GEDmatch. Almost all of the 5-7cM segments do NOT triangulate, and are thus IBS. A few do triangulate – guessing at about 5-10% range. This seems reasonable to me as there are 5cM shared segments which are IBD. I’m adding these into my TGs, but color coding the small cM value to highlight it. To date I cannot recall any which have resulted in a confirming CA. Most of these 5-7cM IBD segments may well be from an even more distant CA…  I also include shared segments down to 5cM from close, known cousins. Most are also IBS, but a few of them, so far, agree with the TG CA, and are probably IBD.

The problem is we don’t have a good test for IBD vs IBS. Some have used results from phased data to develop rough percentages for IBD/IBS ratios vs cMs for shared segments. See http://www.isogg.org/wiki/Identical_by_descent I’ve seen no distribution “curve” yet. We don’t have such data for triangulated segments, so we really don’t know what effect triangulation has. Triangulation depends, in part, on using long shared segments. This, coupled with widely separated cousins who got exactly the same long segment, increases the odds that the shared segments are IBD. These two factors (length of segment and a match) combine to increase the probability of IBD. But as we decrease the shared segment size, we reduce that factor. We don’t know, yet, by how much this affects the curve.

Clearly, very small segments (under 5cM) are much easier to match, although most are IBS. Also, many of these very small segments will also triangulate. Triangulation is not a guarantee of IBD. We cannot use triangulation to prove triangulation. In other words, if segment length is a key factor in triangulation, we cannot say that triangulation itself proves smaller shared segments are IBD – it’s a circular argument. We need more corroborating data.

I am hesitant about establishing “rules” for segment sizes for triangulation. We are dealing with distribution curves – with tails. We have not yet drawn these curves, but at some point (as the segment size is reduced), the false positives will occur, even with triangulation. I am confident that triangulation shifts the IBD/IBS-vs-cM distribution curve “to the left”. Triangulation definitely culls out many (most?) IBS segments in the 7-10cM range. Thus the IBD/IBS ratio for a given cM must increase. To what extent is yet to be determined.

Triangulation is a tool. Use judgment when using it.

For me, shared segments below 5cM are uncharted territory for triangulation. I am confident of a Triangulation “guideline” for shared segments down to 7cM. Based on my experience with most segments in the 5-7cM range being IBS, I’m now fairly confident that triangulation also works down to 5cM. At the least, triangulation culls out most of the IBS shared segments. I think most of the few remaining 5-7cM shared segments which triangulate are IBD. For me, it’s at least worth the chance to include them in a TG and enlist the help of those Matches in finding the CA.

 

13 Segmentology: Small Segments and Triangulation by Jim Bartlett 20150930

VUCA DNA

Recently I was at a meeting with some fellow retired Naval Officers. The subject came around to a concept several of us had learned at the National Defense University in Washington, DC.

VUCA

There is now a Wikipedia Article about VUCA – Volatility, Uncertainty, Complexity, Ambiguity. It was coined to refer to wars, but was later applied to many other situations. As I thought about it, VUCA can be used to describe atDNA.

1. Volatility – The basis of human DNA, the “Build” is always being updated and changed – it was recently changed from Build 36 to Build 37 (by most companies). The chips used to determine SNP values have undergone change. FTDNA reran all of their tests when they changed to the Illumina chip. 23andMe has changed chips (and SNPs) at least twice in the last few years. AncestryDNA significantly changed their matching algorithm recently. There will be more change as newer technology or processes are developed.

2. Uncertainty – Is a shared segment IBD or not? Is a Common Ancestor the genetic ancestor or not? How many cMs should we expect to share with a 4th cousin?

3. Complexity – A matching algorithm may take SNPs from either side. Shared segments may be from either parent’s chromosome. Can it get more complex. Well… sure it can. How distant could the Common Ancestor be? Does the shared segment span two Common Ancestors or not? Which ancestors are not genetic ancestors? Which genetic ancestors passed down segments above a matching threshold? How does endogamy affect our shared segments?

4. Ambiguity – The cM measurements are based on an average of observed values for crossovers by males and females. Base pairs span areas of chromosomes that have not been sequenced. There is no sign post to indicate where a segment from an ancestor starts or stops – so shared segments are often reported as longer or shorter than they really are. Company algorithms are different – what is the criterial for a Match? How do they handle no-calls?

I think the VUCA acronym describes atDNA pretty well…

06Z Segment-ology: VUCA DNA by Jim Bartlett 20150823

The Porcupine Chart

Genetic Ancestors – the Porcupine Chart

First let me define Genetic Ancestors. These are the ancestors who passed DNA down to you. Your DNA includes some DNA from each genetic ancestor. But, as we’ll see, not all of your ancestors contributed to your DNA. Some of your distant ancestors passed DNA down to their descendants, but that DNA never made it all the way to you. So let’s delve a little more into this concept, so you’ll know what to expect as you form Triangulated Groups, work with your Matches to find Common Ancestors, and fill out your chromosome map.

You get exactly 1/2 of your autosomal DNA (chromosomes 1-22) from each parent. Each parent has used the two chromosomes they got from their parents (your grandparents) to create one chromosome for you. Actually this is the fundamental way DNA is passed down to you and to each of your ancestors so let’s take a closer look at this process in Figure 1.

07A Fig 1

Here are some important points about Figure 1:

  1. The parent has two of each chromosome – one from each of his/her parents (the child’s grandparents)
  2. The parent takes part of each of the two chromosomes and makes a new chromosome (this process is called recombination)
  3. The parent then passes this new chromosome to a child
  4. Clearly, there is a wide range of alternative possibilities for the new chromosome – I’ve shown 3 alternatives: one where the new chromosome is roughly a 50/50 split between the grandparents; one with a larger split; and one where there is no split (no recombination), and the child gets all the DNA in this chromosome from one grandparent. All of these possibilities occur in nature. And in fact the odds are that one of your smaller chromosomes (18-22) is probably all from one grandparent.
  5. Note that the child got one chromosome from the parent who started with two chromosomes – the child got exactly 1/2 of the DNA from the parent.
  6. This figure shows the total amount of DNA in the new chromosome, not necessarily how it is split up into segments. For more on segments read: Segments: Bottom-up
  7. This figure is based on one chromosome for the child. The recombination process happens for all of the 22 chromosomes, for each of the two parents.

So, since you got exactly 1/2 of your atDNA from each parent, and they each got exactly 1/2 of their atDNA from their parent’s, wouldn’t you be getting exactly 1/4 of your DNA from each grandparent? Well… no!  As shown in Figure 1, the child could get any mix of DNA from the grandparents – just as long as it added up to 100%. On individual chromosomes the mix can vary quite wildly, but in the aggregate over all 22 chromosomes, the average tends toward 50/50, but with a range of possibilities. However, in each case the two percentages will add up to 100%.

Also you can re-read: Measuring Segments to see that you can measure and total the two grandparents’ segments by base pairs (bp), centiMorgans (cM) or SNPs – you’ll get the same percentages  and totals with any method.

So let’s continue this story by looking back one more generation – to the contribution by the great grandparents. Let’s continue to look at one chromosome and assume the mix from the grandparents is 45/55. To keep the description brief and the graphics clear, we’ll look at the 55% from the grandmother. Just like the standard process in Figure 1, this 55% portion will be composed of contributions from the two great grandparents. The great grandparent mix over this portion could range from 50/50 to 0/100, and the two numbers will always total 100. These two great grandparents cannot contribute to the 45% area (two other great grandparents do that), so their total contribution will only be of the 55% portion. So let’s say their mix is 60/40. So over this chromosome, these two great grandparents contributed 33% and 22% (for a 55% total). Note that the 60/40 split is wider than the 45/55 split. This actually happens in nature – the split’s tend to get wider [or wilder, or more random, or have more deviation more from the average] the farther back you go.

It’s time for another visual depiction – see Figure 2:

07A Fig 2

Important points from Figure 2:

  1. You see the total chromosome (100%) the child got from one parent. [The vertical black “tic” is at 50%]
  2. Under that is the 45/55 split between grandparents.
  3. The third row shows the 55% contribution of the parent’s grandmother, being split between her parents, by 60/40. Think of blue as the paternal side, and pink as the maternal side in each succeeding generation. I don’t show the other grandparents – it gets too messy. I just want to follow some ancestral path back, to show how the genetic (DNA) contribution of some ancestors gets smaller.
  4. Note that although this is shown for one chromosome, the same principle applies to the aggregate for all chromosomes.
  5. Note the use of Ahnentafel numbers to easily keep track of the ancestors.

So, let’s carry this story further back in Figure 3:

07A Fig 3

Important points in Figure 3:

  1. You see the diminishing amounts of DNA that are passed down by more distant ancestors.
  2. In the last two lines (4G and 5G grandparents) you see the percent of each couple still totals 100% (of that portion on this chromosome), but the farther back you go (down the chart) the split between the ancestral couple tends to get wider.
  3. In fact, for the 5G grandparents, one of them drops out altogether. That 5G grandparent that dropped out (#172) probably contributed to all of the other generations down to and including your parent, but when your parent recombined his parent’s DNA, it just didn’t include any of the small contributions from this particular 5G grandparent.
  4. On a chromosome level, this 5G grandparent (#172), may not have contributed to other chromosomes either. Once an ancestor drops out of all chromosomes, their contribution to you becomes 0, and this ancestor is then not a genetic ancestor!
  5. Re-read: Segments: Top-Down to see how the DNA of some ancestors drop out of the mix.
  6. Also note that this 5G grandparent (#172) probably did contribute some DNA to many of his other 5G grandchildren, just not to you.

As we look farther up the ancestral Tree, we find more and more ancestors drop out of the mix – you don’t have any DNA from them.

Another important point in this genetic ancestor analysis is that at each generation, going back, at least one ancestor in each couple has to be there to pass the DNA down. Another way to put this is that one parent in a generation may drop out of the DNA mix, but the other one cannot. One of the two of them had to pass down the DNA that the child (your ancestor) got and passed along, eventually reaching you. And that distant ancestor (who passed down the DNA) had to get that DNA from at least one of their parents. Theoretically, your DNA goes all the way back to DNA Adam and Eve. In a more practical timeframe, as in your genealogy, there will be genetic ancestors at each generation who contributed to your total DNA. This is true whether you have identified them in your ancestral Tree, or they are behind a brick wall – whether they are known to you, or not. At each and every generation, you will have genetic ancestors whose DNA contribution to you will add up to 100%. But not every ancestor will contribute – only the genetic ancestors will… This leads us to what I call the Porcupine Chart in Figure 4.

Figure 4: The Porcupine Chart:

07A Fig 4

Used by Permission

This wonderful chart was developed by The Coop Lab – see http://gcbias.org/2013/11/11/how-does-your-number-of-genetic-ancestors-grow-back-over-time/ for their article and another chart of genealogical and genetic ancestors vs generations. It shows a standard ancestral fan chart, colored in with only genetic ancestors, moving out from you at the center. Figure 4 is an approximation based on simulations. It is not “the” chart for everyone. Your results will vary, just as your random DNA, and the contribution by your ancestors, will vary. This chart is intended to illustrate several key points:

  1. Most of your closer ancestors contribute to your DNA.
  2. At some point a few ancestors drop out of the mix.
  3. When an ancestor drops out, his/her spouse/mate stays in the mix.
  4. Every ancestor who has contributed to your DNA has a porcupine “quill”.
  5. The “quills” extend forever – there is always another ancestor who passed down the DNA (theoretically back to DNA Adam and Eve, but in a practical sense, back farther than you can go on your genealogy).
  6. Although ancestors who contribute DNA to you continue to drop out with each succeeding generation going back, the number of contributing ancestors at each generation going back cannot get smaller. NB: some of the individuals may repeat as multiple ancestors, but the number of positions for ancestors in the Tree who contribute to your DNA never gets smaller. In the extreme, the number of genetic individuals gets very small at bottlenecks and deep ancestry; but the number of “slots” in the Tree is very great.
  7. In a practical sense – in your genealogy timeframe – there will be a growing number of ancestors who contributed to your DNA in each generation [each will have a different Ahnentafel number]; and some of them will be repeat ancestors [one individual ancestor may have multiple Ahnentafel numbers].

Above-threshold segments:

Generally, the closer genetic ancestors will contribute a lot to your DNA, and more distant genetic ancestors will pass down a smaller contribution to your DNA. But a genetic ancestor will always contribute something. We can divide our genetic ancestors into two groups: those who pass down “above-threshold” segments, and those who pass down smaller segments. When all the DNA from a genetic ancestor falls below the threshold value, you won’t see any cousins from this ancestor on your Match list. This is a limitation of our programs today. This is what has happens when a true 3rd cousin doesn’t show up as a Match. You and the 3rd cousin probably share some DNA from a common 2x great grandparent, but not enough to meet the matching criteria. However, if you compare yourself with this 3rd cousin at GEDmatch, and lower the threshold to 300 SNPs and 3cM, you will usually find matching IBD segments. Also, by testing and comparing siblings and other close relatives with this 3rd cousin, you may well find that they have above-threshold segments and match. The point is that some genetic ancestors may be above-threshold ancestors with you, and not others; and vice versa.

OK – we’ve now seen that you have many ancestors, but only some of them are genetic ancestors. And only some of your genetic ancestors will pass down above-threshold segments to you.

The most important group of ancestors to genetic genealogists is a subset of your genetic ancestors – it’s those genetic ancestors who passed down to you, and a Match, at least one DNA segment which is over the threshold amount.

So there is another chart which is based on ancestors who contributed DNA segments over the threshold amounts. The chart will be of a similar form to Figure 4, but the missing ancestors will occur closer to you, and the quills will be truncated when the segments get too small (the quills don’t go back forever). The chart will look like a skinnier porcupine with a crewcut – stop to visualize this….  These are the Common Ancestors we are looking for – this is the portion of our genealogy and our ancestry that we are working with. And, yes, some (many?) of these Common Ancestors will be beyond our known ancestral Trees. Someday… much of this chart will be drawn from our completed chromosome maps.

From my experience, it appears that the number of ancestors who contribute above-threshold DNA segments will usually include all of our 16 2G grandparents, maybe all of our 32 3G grandparents, and most of our 64 4G grandparents (5th cousin level). I think that most of our Matches are in the 6th to 8th cousin level (where some of our 7G grandparents have passed “sticky” segments down to us); and that it drops off after that, with some Matches out to 10-12th cousin level with a few of our more distant grandparents. That would be what the skinny porcupine with a crewcut would look like.

A final note: the ancestors we “see” through matching algorithms may be only part of the genetic ancestors who contribute above-threshold DNA to us. We can only compare with Matches who have taken an atDNA test. Some of our ancestors may have very few descendants, or be from an area or country where few folks take DNA tests. So the fact that we don’t find Matches to some ancestors, doesn’t necessarily mean they didn’t pass down sufficient DNA. I’ll have to explore this more, in a different blog post. Chromosome mapping will also help resolve this.

Summary Observations:

  1. Our genealogy ancestors fill up every slot in every generation of our ancestry Tree – doubling with each generation [each with a different Ahnentafel number] – forever.
  2. Individuals will repeat as ancestors [each may have multiple Ahnentafel numbers]
  3. Our genetic ancestors begin to drop out of our Tree at some point.
  4. Our genetic ancestors who have passed down above-threshold DNA segments  begin to drop out of our Tree even sooner.
  5. At each and every generation, there are genetic ancestors whose DNA contributions total 100% – for all of our atDNA and for each chromosome.
  6. The number of genetic ancestors will increase with each generation. And like genealogy ancestors, individual genetic ancestors can also repeat.
  7. Each genetic ancestor will have an ancestral “quill” of genetic ancestors.
  8. Only genetic ancestors who pass down enough above-threshold DNA will be seen as Common Ancestors between Matches
  9. The number of above-threshold genetic ancestors will increase for perhaps 7-9 generations, and then decrease for the remaining generations.
  10. The above-threshold genetic ancestor chart will look like a skinny porcupine with a crew cut. The crew cut “quills” will include “sticky” segments which survive for several generations.
  11. A porcupine “quill” is not necessarily just one segment. The “quills” are for ancestors, and an ancestor may pass down multiple segments.
  12. At each generation, and on each chromosome, the DNA from a parent will be from some mix from his/her parents – ranging from an even 50/50 split to an all or nothing 0/100 split.

 

 

07A Segment-ology: Genetic Ancestors – the Porcupine Chart by Jim Bartlett 20150806

Why Upload to GEDmatch or FTDNA?

What is the advantage of uploading AncestryDNA results to GEDmatch and/or FTDNA? Let me count the ways… Here are my top 10 reasons.

[1]. To get additional Matches. (from other companies, including Matches below thresholds)

[2]. To get Matches with emails. And most at FTDNA have real names; many at GEDmatch have real names.

[3]. To get cooperative Matches. A much higher percentage of folks who test at FTDNA will work with you on genealogy. Same with folks who have taken the trouble to upload to GEDmatch.

[4]. To see the shared DNA segment. This is probably the most important reason, IMO! For each shared segment with a Match, you see the chromosome number, start and end locations, cM value, and number of SNPs included. This is technical DNA info, but it is invaluable to those who utilize the DNA beyond just a list of Matches (who may or may not be related – read on…)

[5]. The shared segment data allows the tester, or a Match, to confirm the segment is a true segment from an ancestor (that the segment is Identical By Descent, IBD) – this is done by Triangulation with other shared segments

[6]. The shared segment data allows the tester, or a Match, to evaluate the segment – a small segment indicates a distant relationship; a large segment, or multiple segments, indicates a closer relationship.

[7]. The shared segment data allows the tester, or a Match, to group segments from Common Ancestors – you will tend to have only one (or very few) different segments from a distant ancestor, and this is where you will find other cousins from that ancestor.

[8]. With Colonial American ancestry (and other endogamous populations), you may have multiple Common Ancestors with a Match. The shared segment data will allow you to determine which ancestor the DNA came from, because all who have the same shared segment data should descend from the same Common Ancestor.

[9]. Admixture (ethnicity, ancestral geography) reports are different at different companies. GEDmatch, in particular, has several utilities with a range of admixture evaluations that target different areas.

[10]. GEDmatch has other utilities, including seeing if your parents were related.

Readers are invited to add other reasons to upload AncestryDNA results to FTDNA and/or GEDmatch in the comments section.

21 Segmentology: Why Upload to GEDmatch or FTDNA by Jim Bartlett 21050611

Segments: Top-Down

In blog post Segments: Bottom-Up we looked at ancestral segments from the bottom up. That is we started with you and the very large segments (aka chromosomes) you got from each parent. Then we continued with two crossovers per generation to the 3G grandparent level. Of course, this is not the way DNA works! DNA comes from our ancestors to us – Top Down. So let’s see if we can reconstruct the segments we had in Figure 6 of Segments: Bottom Up. Here is that Figure again as Figure 1:

05A Figure 1

To do that, we have to start with the chromosomes of the 16 3G-grandparents; and take them Top-Down a pair at a time. Ready…? Here we go – Figure 2:

Observations from Figure 2:

05B Figure 2

I’ve highlighted the segments (in yellow) that we wound up with in Figure 1, just so you (and I ;>j) can keep track of them as we go through the recombination process.

Note that with only 2 crossovers per generation, there are not a lot of subdivided segments.

In the left-hand area, “3G gp” is Great, Great, Great grandparent; the numbers 48 to 63 are Ahnentafel numbers; and M means mother; F means father.

So the next step is recombination of each pair of parents, as show below. This mean 2 crossovers for each pair (noted by the vertical lines, and flipping the 2 chromosomes after each crossover. Then one of these 2 chromosomes is not used (noted by the X). See Figure 3:

05B Figure 3

Observations from Figure 3:

The highlighted segments are all in the chromosome which gets passed on to the child.

Note that at each generation, the highlighted areas “cover” the whole chromosome.

One of the two chromosomes is X’d as it is not passed to the child.

The remaining chromosome in each pair is passed to the child

The child is noted by an Ahnentafel number which is half – e.g. 28 is the child of 56 and 57.

Let’s see how this looks with the remaining ancestors in pairs in Figure 4:

05B Figure 4

As before these pairs of chromosomes are recombined at the crossover points as shown in Figure 5:

05B Figure 5

Observations from Figure 5:

Same as observations from Figure 3.

Compare these segments (Ahnentafel numbers and locations) with Figure 1.

Let’s see how this looks with the remaining 4 G-grandparents, arranged in pairs in Figure 6:

05B Figure 6

As before these pairs of chromosomes are recombined at the crossover points as shown in Figure 7:

05B Figure 7

Grouping these two remaining maternal grandparent chromosomes in the mother, for recombination seems very familiar from the Segments: Bottom Up blog post. Let’s see in Figure 8:

05B Figure 8

See if you can determine where the 2 recombination points have to be, in order to wind up with the same results we show in Figure 1 – which was the final result of a Bottom-Up look. Here is Figure 9:

05B Figure 9

Finally we have the chromosome from your mother in Figure 10:

05B Figure 10

Final Observations:

A comparison with Figure 1 shows the same outcome with a Bottom Up review.

There were many of the “random” recombination points above which could have been different because, in the end, that portion of the DNA was not passed down.

But certain recombination points had to be where I selected them (not at random), because I was working toward a known chromosome map – the one we came up with in the Bottom-Up review. If I had been truly random in the Top-Down analysis, I could go back and recreate the same answer using Bottom-Up. The point is that you can arrive at the same answer using a Bottom-Up or a Top-Down analysis.

Your DNA already has all of these recombination points in it. Triangulated Groups will help you define these points.

For you, each recombination point has already been determined, some at one generation and some at another. How to determine which generation it is will be a separate blog post… (when I figure it out ;>j – think MRCA with each Match…)

Remember there are about 70 recombination points created across all of your chromosomes in one generation – usually a few more from your mother’s side, compared to your father, but it’s an average, and DNA is random.

Note that your recombination points will not be the same as your Match’s recombination points – your Match almost certainly did not get the same ancestral segment you did (re-read What is a Segment? for the difference between ancestral segments and shared segments). This is one of the key reasons why spreadsheets (with shared segments) for different people should be kept separate. Although you may have a shared segment with a Match, there is no correlation between your ancestral segments/crossover points and your Match’s.

We can see above, the generation where a particular ancestor’s DNA is no longer included. We noted this in the Bottom-Up post as well. Although that ancestor is no longer on this chromosome, he/she could well be on a different chromosome.

Note that the ancestors whose DNA is no longer included on your chromosome, is not the result of segments being subdivided into oblivion (or even small segments that don’t show up as a match). Ancestors dropped out of the picture largely because their DNA was on the chromosome that was not used. At every generation, half of the DNA involved is not used! It’s no wonder, some ancestors drop out. Again, we are not talking about little pieces of DNA, but large chunks in many cases.

You are encouraged to take pencil and paper (or a spreadsheet), and try variations on crossover points and recombination. Play with crossover points at various locations. Put them close together in one generation to create a very small ancestral segment. Or try dividing segments in half with each generation (averaging, say, 2 crossovers per generation) and see how many generations there are before all the segments are below a threshold (say 7cM). I think you’d be surprised at how many generations back you can go. And if you unbalance the process (not cutting each segment in half (50/50), but say 25/75, or even 10/90), you will get even more generations with above threshold segments.

Have fun with it.

05B Segmentology – Segments: Top-Down by Jim Bartlett 20150601

Fuzzy Data, Fuzzy Segments – No Worry

The data we get from our atDNA test is not as precise as it may appear to be. But don’t worry, it doesn’t make any difference…

Let’s start with the 3 ways we measure our DNA first (see also my blog post Measuring Segments).

Base Pairs: Our DNA has about 3.2 billion base pairs in each set of chromosomes. The Human Genome Project sequenced about 99 percent our chromosomes in 2003. Since then, scientists have continued to refine the structure and arrangement of the base pairs. Most of our atDNA tests have been based on “Build 36”, but FTDNA shifted to “Build 37” a year or so ago. Because the atDNA tests only look at about 700,000 of those base pairs, the change from Build 36 to 37 is not much different. But most of us “lost” a few Matches and “gained” a few Matches as a result. To my knowledge, 23andMe and AncestryDNA still use Build 36, as does GEDmatch. The differences are slight.

cMs: The cM is an empirical measurement. There are differences in the observed cMs for males and females over the same segment (same start and end location). An average is used because the companies just don’t know the male/female ancestry of each shared segment. Besides, an average is much easier to work with… Even the tables of averages differs by company. Here are the totals per the ISOGG/wiki (without Chr X) [1]:

23andMe                    3,537 cMs

FTDNA                       3,384 cMs

GEDmatch                  3,587 cMs

Per the CRC Research Group [2] we have very different totals of the average for males and females in the 22 autosomes:

Male                            2,809 cMs

Female                        4,782 cMs

Average                       3,795 cMs

The reason this last average is greater that the averages used by the three companies is because there are certain areas of certain chromosomes that are blocked out from atDNA testing for genealogy (see the greyed area in your chromosome browser). So the companies only take the average of the areas covered (sampled) by the SNPs.

The important point to understand is that there is a wide variation between males and females. Using an average pretty much guarantees some inaccuracy. So definitely use cMs as a guideline, but don’t split hairs or make hard and fast “rules” with the cM values we get from the testing companies

SNPs: Each of the atDNA companies use a different number of SNPs per the ISOGG/wiki Comparison Chart [3]:

23andMe                    577,382 SNPs

FTDNA                       708,092 SNPs

AncestryDNA              682,549 SNPs

Note that 23andMe now uses a somewhat lower number of SNPs, and some of them are different from the other companies. But since the SNPs are basically a sampling technique over all of our DNA, we see little differences in the shared segments now reported by 23andMe.

So the start and end location of shared segments will tend to be different depending on where the terminal SNPs are. You might think with the largest number of SNPs, FTDNA can report a more accurate shared segment, but read on…

Shared segments are determined by a proprietary algorithm at each company. I don’t know them. And even if I did, I shouldn’t report it.

When determining 700,000 or so SNPs it’s hard to get all of them read correctly. There are invariably a few miss-calls and no-calls. Each company’s algorithm has an instruction as to handle these: ignore one (or a few?) of them and report a longer shared segment; or let a miss-call or no-call break up an otherwise long segment. Based on GEDmatch examples of kits from 2 companies for the same person, each company probably does it differently.

There is no “signpost” in our DNA to indicate where an ancestral segment starts or ends. Each algorithm looks for matches between two kits, and it may well run beyond the boundaries for a particular ancestor, and in these cases pick up, random, but matching, pieces of DNA from other ancestors, which would not be IBD. This makes some shared segments look larger than they should be. A classic example is a parent-child-Match trio where the child-Match shared segment is a little larger than the parent-Match shared segment.

The algorithms may also include some shortcuts. Notice the large number of segments at FTDNA which end in “00”. It appears some of their algorithm is based on looking at blocks of 100 SNPs at a time – in this case the size of a shared segment would be rounded down to only the blocks of 100 that match. The advantage FTDNA might have because of more SNPs in total, is offset by using them in blocks.

Here is a quote from the 23andMe Family Inheritance: Advanced page: “… segments can be measured in centiMorgans or in base pairs for mapping onto the genome. 23andMe rounds the segment length to the nearest tenth of centiMorgan and segment start and end coordinates to the closest millionth base pair to reflect the uncertainty in the exact locations of the segment boundaries.” [bolding by me]

AncestryDNA states their algorithm culls out segments based on “population” phasing. They also eliminate some pile-up segments, although it’s not clear (to me anyway) what size range they consider for this culling process. This process eliminates some IBS segments, but is also eliminates some IBD segments, too.

All of the above factors, and probably more, result in the DNA data being a little different, depending on the company – we can say the data is a little fuzzy, and not quite as precise as it appears to be. With this knowledge, I’m pretty sure cMs values are not accurate to two decimal places, and shared segments are not precise to a particular base pair.

Clearly this fuzzy data leads to fuzzy segments. The statement by 23andMe is a good one – round the segment start and end locations to the nearest million base pairs.

So let’s look at all of this fuzzinessfrom the Big Picture of Chromosome Mapping with Triangulated Groups – is it significant?

Each TG is a collection of overlapping shared segments (which match each other). As far as defining a TG is concerned, the only two SNPs that count are the first one and the last one. These SNPs are often from one of the segments that overran a little – either at the start or the end location. So the TG may be a little larger than indicated – the ends of the shared segments are a little fuzzy, so the TG is a little fuzzy, too.

When Chromosome Mapping is complete, you should have a bunch of TGs that cover each chromosome, from one end to the other (see Segments: Bottom-Up). I often use chromosome 5 in examples because it is about 200cM, and averages about 2 crossover points per generation. In my Chromosome Map, so far, I have 9 paternal TGs and 11 maternal TGs that cover my two chromosome 5s. From the detailed data, the tips of these TGs may overlap a little. By a little I mean maybe up to 1 or even 2 Mbp. Don’t let it worry you. In the Big Picture, if you have 10 TGs with various Common Ancestors on chromosome 5, you’re doing great! You’ve won! The fact that you don’t know precisely where each shared segment or TG starts or ends, pales when you run Kitty’s Chromosome Mapper [4] and see your ancestors mapped across chromosome 5. The little fuzziness is lost in the Big Picture. When you finish a jigsaw puzzle and step back to admire the full picture, you don’t even notice the outlines of each puzzle piece. For genealogy, you have achieved your objective – notwithstanding the fuzziness.

Another clue that fuzzy data and fuzzy segments are not an issue, is that my Matches who have tested at multiple companies, still share virtually the same segments with me. The shared segments are almost never identical in start/end locations, cMs or SNP counts. However, they always show up within a few rows of each other in my sorted spreadsheet. And they always wind up in the same TG!

So if you want to make life (TGs and Mapping) easier, use Mbp for segments. And since they are just guidelines, round off cMs to the nearest whole number. It won’t hurt anything. For me, overlapping segments and cM thresholds are just guidelines to group segments, and then form TGs (see How To Triangulate)

And if, someday, you decide to tackle genes, and you need to know exactly which ancestor gave you the gene that appears to straddle two TGs (meaning two ancestors), you can reexamine that junction more carefully. You can always refine the crossover points later, if you want. But for now, spend your energies in forming the TGs and determining Common Ancestors for them! I’m curious about which ancestors gave me which genes, but I doubt I’ll ever go back once my chromosomes are properly mapped to distant ancestors – I’ll be pooped and ready to try something else…

Reference links:

[1] http://www.isogg.org/wiki/CentiMorgan

[2] http://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC52322&blobtype=pdf

[3] http://www.isogg.org/wiki/Autosomal_DNA_testing_comparison_chart

[4] http://kittymunson.com/dna/ChromosomeMapper.php

03A Segmentology: Fuzzy Data, Fuzzy Segments – No Worry by Jim Bartlett 20150529

Segments: Bottom-Up

This blog post looks at the segments you got from your ancestors. It will be an effort to outline what you should expect from various ancestors – to give you an overview of how you got your segments, and how they are arranged. Your DNA is like a big jigsaw puzzle, with many different and unique pieces – this will let you see a picture that might help you solve the puzzle. Of course, your picture will be somewhat different. DNA is very random, and there is wide variation in what you actually got. Nevertheless, there are averages, and there are some rules. I hope this post will give you an understanding of the big picture, as well as some detail, and help you work with atDNA segments.

There are two ways to look at your DNA: Top-Down and Bottom-Up.

Top-Down is the way you got your DNA segments – from your distant ancestors (from the top of your Tree), down to you (at the bottom of your Tree). This Top-Down explanation often includes many ancestors and DNA segments (lots of colors in the diagrams), and can get quite complex after a very few generations. I’ll attempt a simple version of the Top-Down look in a separate blog post.

Bottom-Up tends to be the way we look at our DNA – we start with all of our own DNA and divide it into maternal and paternal sides; and then determine the segments from grandparents and Great grandparents, etc. We work from ourselves (at the bottom of our Tree), up the Tree as far as we can go. This is the “look” that will be described below.

Before we start, there are three points to make:

  • We are talking about ancestral segments (see What is a segment). These are all segments you got from your ancestors.
  • We are not talking about shared segments with Matches. This discussion is only about you and your ancestors and the segments from them. There will be more about shared segments in later other blog posts.
  • DNA is random. This is a rough model; a general picture of our DNA segments. Please don’t get lost in the technical details or in an unusual situation. This is all about what you are likely to encounter – it’s definitely not a one-size-fits all description. Your DNA is different, but the general principles below will apply.

Before we discuss shared segments (in a later blog post), it’s important first to understand how our DNA is made up of segments from different ancestors, and generally what to expect about these segments – how they are arranged in each chromosome. This understanding of ancestral segments will help you understand shared segments later.

Three ground rules:

  • This discussion will be about autosomal DNA (atDNA) – the numbered chromosomes (chromosome 1 to 22)
  • This discussion will focus on one parent – Mother. The concepts apply equally to either parent.
  • Examples will usually use chromosome 5.

So let’s get started…

Parents

You get many large segments from your Mother. They are exactly the size of each chromosome – because each segment is a chromosome. Your Mother gave you one of each of the 22 autosomes (chromosomes 1 to 22). In each case this is a large segment from the beginning of a chromosome (the first base pair) to the end of each chromosome (the last base pair). You probably already knew that you got one set of chromosomes from your Mother, but you may not have thought about them as very large segments. But this is all about segment-ology. You get your DNA segments from your ancestors – your Mother is an ancestor, and she gave you the ultimate segments – entire chromosomes.  See the chromosome 5 example in Figure 1.

05A Figure 1

So where did your Mother get this large segment? She got it from her parents – your two grandparents – through a process called recombination. Read on…

Recombination and Crossovers

Here is a very brief overview of recombination and crossover for genealogists:

A parent takes parts (segments) of the two chromosomes from her parents, and creates one new chromosome which she passes on to a child. Basically, when recombination occurs, a parent starts with one of their parent’s chromosomes and then shifts, or crosses over, to the other parent’s chromosome. This recombination results in two segments separated by one crossover. This process may be repeated several times on one chromosome. We’ll talk more about the probability of recombination below.

Recombination is a very complex process that is the cornerstone of life and diversity. You can google “DNA recombination” for more, but this brief summary is all you really need to know for genetic genealogy. See the Figures below to see examples of how this works.

Three important points:

  • After recombination, the new chromosome is exactly the same size as each of the two chromosomes which were used to form it.
  • The segments are “heel-and-toe” – that is, they are adjacent. When one segment ends, the next segment starts. There is no gap between segments.
  • The crossover point marks the point between segments from two different ancestors. You change from one ancestor to another at this point. 

Grandparents

So let’s look at your maternal chromosome 5 at the grandparent level. That is chromosome 5 with segments from the two maternal grandparents. See the chromosome 5 example in Figure 2.

05A Figure 2

There is lots of information here:

  • Segments from the two grandparents “fill up” the entire chromosome.
  • The segments from the grandparents alternate.
  • There are three segments and two crossovers (we will look more into the number of segments and crossovers below).
  • There are no gaps between segments.
  • These segments tend to be large, and the crossover points tend to be widely separated.
  • Note: Only ancestors from grandparent 1 can contribute to the segments from grandparent 1. In other words the segments for grandparent 1 can only come from the ancestors of grandparent 1. Ditto for grandparent 2.

Crossover Points

OK – before we continue, we need to look at the realistic number of segments and crossover points we should expect in each generation. Science has found that in one generation (Mother to you, for example), there are about 35 crossover points spread out over all 22 chromosomes. In fact the cM is defined by the probability of a crossover – such that there is a probability of a crossover every 100cM. So let’s look at Table 1:

05A Table 1

Each atDNA testing company shows a slightly different table of cMs for each chromosome. You can see a report of cM for various companies at http://www.isogg.org/wiki/CentiMorgan. Don’t get hung up on the exact numbers – it’s the overall concept that counts. The average number of crossovers is the cMs in that chromosome divided by 100. Since the number of crossovers per chromosome must be a whole number, I’ve shown several alternatives above. Although the average may be 1 or 2 or 3, possibilities may include 0 or 4 (or more sometimes). The point is that there are only a few crossovers expected for each chromosome, in one generation, and if more occur on some chromosomes, there tends to be fewer on some other chromosome. This process occurs in each generation – for instance when a parent recombines the grandparent’s chromosomes and passes a single chromosome to you. It also happens when a grandparent recombines the Great grandparent’s chromosomes and passes a single chromosome to your parent. So there are about 35 crossover points at each and every generation, and they add up, depending on which generation is under consideration. This will be described for each generation in more detail below.

Important points from the above info:

  • Recombination does not “puree” the DNA into tiny pieces (segments). Recombination tends to divide each chromosome into a few segments.
  • Clearly with only a few recombinations, the resulting segments in one generation will tend to be large.
  • Clearly with only a few crossover points, they tend to be distributed over the chromosome.
  • DNA is random, and the number of crossovers in your chromosomes may vary. If you have more than average on some chromosomes, you will probably have fewer than average on some other chromosomes.
  • Regardless of the number of crossovers, or their locations, all the resulting segments on a chromosome will fill the chromosome, with adjacent segments, from one end to the other.
  • When there is 0 crossover, this means there was no recombination. This means that chromosome was passed intact to the next generation. Given the probabilities in Table 1, there is a high probability that at least one of the smaller chromosomes will be passed intact with each generation.

Great Grandparents

So, given the probability that there are two additional crossovers in each generation for chromosome 5, let’s look at a probable scenario from the Great grandparent’s perspective in Figure 3.

05A Figure 3

Important information from Figure 3:

  • Again, segments from the four Great grandparents “fill up” the entire chromosome.
  • The “new” segments from the Great grandparents alternate; and they alternate within their respective child. That is Ggp1 and Ggp2 are parents of grandparent 1; Ggp3 and Ggp4 are parents of grandparent 2.
  • There are two new crossover points (shown by large vertical lines); and the previous crossover points are still there.
  • There are no gaps between segments.
  • Again, these segments tend to be large, and the crossover points tend to be widely separated.
  • Note: Only grandparent 1 ancestors (Ggp1 and Ggp2) can contribute within the segments from grandparent 1. In other words the segments for grandparent 1 can only come from the ancestors of grandparent 1. Ditto for grandparent 2.
  • We only had two new crossover points for chromosome 5, so they could only subdivide two of the three grandparent segments. Sometimes there may be 3 crossover points, but then, sometimes there may only be one crossover point. Even with 2 crossover points, they could have both occurred within one grandparent segment. When dealing with random DNA, there are many possibilities. We used the average of 2 crossover points to paint the best overall picture. You are invited to print Figure 3 and randomly place one, two, three or four crossover points anywhere you want. If you put two, or more, crossovers in one grandparent segment, be sure to alternate the Great grandparent’s segments: Ggp1-Ggp2-Ggp1…
  • Note that there is no crossover point through the last segment for grandparent 1. That means the second grandparent1 segment was passed down, intact, from one of the Great grandparents. There is a 50% probability for either one. But only for one. We sometimes refer to such a segment as a “sticky segment”, because it appears to stick together through a generation.
  • We now show five, rather large, segments at this Great grandparent level of chromosome 5.
  • Note that the first new crossover point divides the grandparent 1 segment into two segments, one for each parent of grandparent 1. On the other hand, the second crossover point is separating two segments from different parents of the grandparents. These two parents, labeled Ggp2 and Ggp3 are not related to each other by marriage or otherwise. So some adjacent segments may be for husband and wife; and some may be for distant, unrelated ancestors.

2Great Grandparents

OK – moving on to the next generation back – continuing our Bottom-Up look… Again, we will add two more crossover points to get Figure 4:

05A Figure 4

Important information from Figure 4:

  • As always, segments from the 2G grandparents “fill up” the entire chromosome.
  • The “new” segments from the 2G grandparents alternate; and they alternate within their respective child. For instance 2Ggp5 and 2Ggp6 are parents of Ggp3.
  • There are two new crossover points (shown by large vertical lines); and all the previous crossover points are still there.
  • There are no gaps between segments.
  • Again, these segments tend to be large, and the crossover points tend to be widely separated. But in this example the first new crossover point occurs fairly close to an existing crossover point, so a relatively small segment is created for 2Ggp3. It happens… and small segments are created.
  • Again, only segments from parents can contribute to a child’s segment. So 2Ggp1 is a parent of Ggp1; 2Ggp3 & 4; are parents of Ggp2; 2Ggp5 & 6 are parents of Ggp3; 2Ggp8 is a parent of Ggp4; and 2Ggp3 is a parent of Ggp2. Note that there are no segments for 2Ggp2 or 2Ggp7 shown. Those ancestors did not contribute any DNA to chromosome 5.
  • Note that there are no crossover points through the first, fourth and fifth segments at the Great grandparent look. That means these Great grandparent segments were passed down, intact, from one of the 2G grandparents. There is a 50% probability for either one, and I selected one of those in each case for example. Now we have two “sticky segments” from the previous generation; and one “sticky segment”, 2Ggp3, which survived three generations.
  • We now show seven, mostly rather large, segments at this 2G grandparent level of chromosome 5.
  • Again, note that the first new crossover point divides a segment into the two parents.

3Great grandparents

Let’s look at one more generation – to get the hang of it – and then draw some general conclusions about what to expect.

05A Figure 5

Important information from Figure 5:

  • As always, segments from the 3G grandparents “fill up” the entire chromosome.
  • The “new” segments from the 3G grandparents alternate; and they alternate within their respective child. For instance 3Ggp1 and 3Ggp2 are parents of 2Ggp1.
  • There are two new crossover points (shown by large vertical lines); and the previous crossover points are still there.
  • There are no gaps between segments.
  • Although at this generation going back (Bottom-up), these segments tend to be large, but with each generation, a few segments are split into smaller segments. Another relatively small segment has been created.
  • Now there are 8of the 16 3G grandparents missing (3Ggp3, 4, 5, 8, 9, 13, 14, and 15)
  • The last, 3Ggp6, segment has now survived, intact, from the 3G grandparent level down to you.
  • We now show nine segments at this 3G grandparent level of chromosome 5.
  • Again, note that the first new crossover point divides a segment into the two parents.

So what are the big-picture observations:

  • At each generation going back, each chromosome is made up 100% by segments from that generation.
  • The segments at each generation are adjacent to each other; there are no gaps.
  • On, average, there are only two new crossovers at each generation. So only two segments are subdivided at each generation.
  • From here on out, at each generation, most of the segments will remain the same size; and only a few will be subdivided.
  • Some segments, particularly the smaller ones, will appear to be “sticky” and survive for several generations without being subdivided.
  • More and more ancestors, at each generation, will drop out of the picture as you move to more distant ancestors. This applies only to the chromosome under consideration. These ancestors may well be found on other chromosomes. Because the DNA is random, many of your ancestors will be represented on some chromosomes, for many more generations.
  • This should dispel the common idea that all segments are cut in half with each generation. This may be true for averages, but in practice, we found above that only a very few segments are subdivided each generation.
  • “Sticky segments” are normal. In fact, in dealing with the smaller segments and comparing with a parent, you’ll often find you have virtually the same segment as your parent, or none at all. More on this in a later blog post on shared segments.
  • When a husband and wife have adjacent segments, then that crossover was created in their child.
  • Although this “picture” was developed for Mother, the same principles apply to the father’s side of autosomal DNA. And, of course, everyone’s version would be uniquely different. But, on a big picture level, it would be somewhat similar.
  • You can use Kitty’s chromosome mapping program to show your results at any generation up to 20 ancestors. Just list the ancestors of that generation in the MRCA column. See http://kittymunson.com/dna/ChromosomeMapper.php

Final Thoughts

Remember this whole discussion is based on your ancestral segments. Your ancestral segments are defined by the crossover points. The crossover points are locked into your DNA when you were conceived. They never change. They define the picture of your segments in each of your chromosomes. They define which ancestral lines contribute to which segments on your chromosomes. We’ll talk about this more in discussions about shared segments with Matches. We don’t really see the picture of our own segments in a chromosome browser. The browser doesn’t know where your crossover points are. What we see in a chromosome browser are shared segments. By grouping and Triangulating these shared segments, we can learn where the crossover points are. Much more in future blog posts….

05A Segmentology: Segments: Bottom-Up by Jim Bartlett 20150523

Epilogue

In my spreadsheets and analysis, I use Ahnentafel numbers. They are a standard numerical code for each ancestor: I am 1; my father is 2, my mother is 3; my 4 grandparents are 4-7; etc. They offer a unique shorthand for indentifying ancestors.  Here is a summary of the chromosome 5 charts in this post, using Ahnentafel numbers for ancestors. Starting with 3 for my mother…

05A Figure 6

Measuring Segments

For autosomal DNA, segments are measured 3 ways:

base pairs (bp) – these are the individual building blocks (molecules) that form each chromosome. Over the entire set of chromosomes there are 3.2 billion base pairs. Each chromosome has from 48 to 250 million base pairs. So a segment can be defined by the Start Location and End Location. Think of base pairs as a physical picture of a segment – it’s physical length and location on a chromosome.

Example: Start at 23,500,000; End at 46,300,000

Many of us round to the closest Megabasepair (Mbp). A Mbp is 1,000,000 bp

Example: The segment is 23.5-46.3Mbp

Rounding makes these numbers much easier to read and to type. And, in my opinion, there is virtually nothing lost in accuracy. Mbp is just fine for genealogy, triangulation and chromosome mapping. If you want to do some analysis of a particular part of a segment for some scientific or medical reason, you may want to use bp. (I’ll discuss “fuzzy data” and “fuzzy segments” in a separate blog post)

centiMorgans (cM) – I think of cM as a “quality” factor, or a “genetic distance” of sorts. The cM is the best measure we have of genetic distance, but it is far from perfect. The cM is empirically derived – that is scientists have recorded many observations and put them into tables. From these tables the cM value between any two points on a chromosome (as measured by bp) can be determined. In very general terms, more is better, and the larger the cM of a shared segment, the closer the Match would be. DNA is very random, and there are wide ranges of cM vs cousinship (including much overlap).  See these references for more info [1], [2] and [3].

Example: A segment may be 15.4cM

SNPs – the single molecules (nucleotides), or base pairs, which show some amount of variation in human DNA. Most (99%) of our DNA is the same. For genealogy, we are looking for SNPs (sometimes referred to as markers) which are known to vary. The difference in our SNPs is what sets us apart. Basically each SNP can have one of 4 values: A, C, G or T. Each of the autosomal DNA testing companies uses a slightly different “chip” to determine these values, and they each effectively test a different number of SNPs – usually in the range of 600,000 to 700,000 SNPs. These are spread out over all of your chromosomes – think of them as a sampling of your DNA (a sampling of the most variable parts of your DNA). This might range from about 10,000 SNPs on the smallest chromosomes to over 58,000 on Chromosome 1.

Example: A segment may include 2,451 SNPs

Note that there is no firm correlation between these measurements. We can convert temperature measured in Centigrade to Fahrenheit because they both measure the same thing. All of the above measurements, measure different things. However, on average, there are about 100 cM for 100 Mbp

So, in summary:

A segment may be described: Chr 6: 23,500,000 to 46,300,000; 15.4cM; 2451 SNPs

A short cut description may be 6: 23.5-46.3Mbp; 15.4cM; 2451 SNPs

References:

[1] http://www.isogg.org/wiki/CentiMorgan

[2] http://en.wikipedia.org/wiki/Centimorgan

[3] http://compgen4.rutgers.edu/mapinterpolator

03 Segmentology: Measuring Segments; by Jim Bartlett 20150513