Confusion about Base Pairs

A Segment-ology TIDBIT

Let’s sort this out. A Chromosome is a long string of DNA – which has the form of the famous double helix. If we flattened out the double helix it would look like a ladder, with two sides connected by lots of rungs. On each end of every rung is a molecule we call a base – called A, C, G or T for short. The two ends of each rung are always paired, with A on one end and T on the other end, or C on one end and G on the other. That’s because in chemistry, the A molecule bonds much more readily with a T; and a C bonds easily with a G. They form what is called a base pair. And if you know one end of each rung, you know the other end. 23 chromosomes make up a genome, and a genome has about 3 billion of these base pairs*.

As we look at one side of the chromosome “ladder” we see one of these molecules at every rung. Important: There is no hard and fast rule about the order of the ACGTs along one side of the ladder.

In our bodies we have two genomes – one set of chromosomes from the father and one set from the mother.

For atDNA testing a laboratory looks at, say, 600,000 specific base pairs called SNPs (pronounced snips). Each of these SNPs is at a specific location on a chromosome, and the lab looks at one side (the “forward” side) and determines if it is an A, C, G, or T. Because we have two of each chromosome, they actually get two values (called alleles), one from the paternal chromosome, and one from the maternal chromosome. Because all these SNPs are floating around in a soup, we don’t know which one is from Mom and which one is from Dad. One convention is to list them alphabetically, resulting in ten possibilities: AA, AC, AG, AT, CC, CG, CT, GG, GT, and TT.  You can see that in these “pairs” the A is not necessarily paired with T. That’s because the DNA from each parent came to you from very different, and usually very distant, paths – they don’t touch or interact with each other. And, the SNP base pairs were chosen for and atDNA test, because they offer variability.

A shared DNA segment between you and a Match consists of a long string of SNPs (usually 1,000 or more) where you have at least one of your two alleles match at least one of your Match’s two alleles. The longer the shared segment, the greater the probability that it had to come from a Common Ancestor.

BOTTOM LINE – As genetic genealogists we are not concerned with the “base pairs” on each end of a rung, we are very much interested in the two SNP alleles we got from our two parents – not called “base pairs.”

*[An experiment you can do at home: at GEDmatch compare your kit to your kit in the one-to-one utility – you’ll match on 22 chromosomes, from start to finish. Add up the “End Locations” and see how close to 3 billion you come – add in about 155 million for Chr X to get a full genome].

 

[22U] Segment-ology: Confusion about Base Pairs TIDBIT by Jim Bartlett 20180502

31 thoughts on “Confusion about Base Pairs

  1. I try to explain g meticulous genealogy and my two relatives want to hear a scientific explanation and a study to explain why it’s not just a money making scam. Can you please punt me to something I can hand them. The genetic md has seen the Ancestry.com print outs and laughs. The Engineer just asked for a second test to compare. Results. I don’t debate well.

    Thanks, Margery Webb

    Like

    • Marjorie, I, too, am an engineer – registered in Virginia for almost 40 years. I think that’s why I like genetic genealogy so much – a confluence of our genealogy puzzle (filling in all the ancestral blocks) and the genetic puzzle (linking specific pieces of our DNA back to specific ancestors – aka chromosome mapping) To me this is the ultimate puzzle: where did my building blocks come from? I try to keep it on a plain English level, but I have had to study a lot of biology – including a free only semester from MIT: DNA the Secret of Life. About the best single resource for us is http://www.ISOGG.org/wiki – particularly the autosomal tab – with pages with many references and links. You can go about as deep as you want. But I always come back to the fundamental tenant: if two people share enough matching DNA, the only way that could happen is if they got it from a Common Ancestor.

      Jim

      Like

  2. Hi Jim
    Thank you for the explicit information in this blog. However it is my understanding that although we have 46 chromosomes per human cell this constitutes 1 genome per cell not two. Am I misunderstanding the description? Thank you for your help
    Sue L

    Like

    • Susan, I’ve seen “genome” defined both ways and used both ways – many times over. Maybe I should have used different wording so as not to distract from the important message. Jim

      Like

      • Jim, That is good to know. I always want to know what is out there and I am more isolated than you. Now I’ll know what people mean when discussing a single individual’s genomes. Thank you for taking the time to clarify the situation for me.
        Sue L

        Like

    • Doris, the rs numbers represent the unique address for each SNP. Some companies use a different mix of SNPs. It’s important to identify exactly which ones were tested, so they are always compared correctly with other test. I think the rs numbers refer to some look up table that tells a scientist how to “find” each SNP. Think of the rs numbers as the number in your street address. Jim

      Like

      • Thank you, Jim. That’s exactly what I wanted to know. I was confused by the 3 million vs. 600,000 base pairs and didn’t equate that to SNPs. Now I understand.
        Doris

        Like

  3. Thanks for clarifying this, Jim. However, we are concerned with base pairs in one sense, aren’t we? The start and stop numbers for matching segments are in units of base pairs, right? Another kind of address.

    Like

  4. A great explanation to distinguish base-pairs from the pair of alleles one sees in RAW data files. Just a few quick comments (FYI, I am an engineer also and not a biologist so forgive as I am learning as well):
    (a) You need to include Y and its roughly 58 million base pairs.
    (b) While most/all report 3.2 billion base pairs in the genome, that is only the haploid count (or counting half the chromosomes that exist in each cell). We actually have ~6.4 mil base pairs (roughly) in the nuclear DNA (the Diploid chromosomes). Males a bit less. This may be the source of confusion when seeing the genome described both ways.
    (c) The GEDMatch tool is reporting on centimorgans (cM). While many like to equate 1 centimorgan to 1 million base pairs, this is not really true. (Note: deserving of its own post to simpify and disambiguate!) Overall, it comes out about the same. But not when looking at individual sections of DNA. I have seen 10 centimorgan match strands that have over 25 million base pairs. And the reverse. Some tools report a match between Parent and child as ~3,400 cM but most closer to ~3,600cM. The latter sometimes when including X matching in the report. GEDMatch reports ~3,600cM for autosomes (1-22) and another ~200cm for X. Almost always, the cM total count is higher than the base pair count (after multiplying cM by 1 million).
    (d) The GEDMatch chart is really only showing the half-identical match process. It indicates, in the graphic form, full-identical match areas. This is not really the same as Haploid and Diploid but is comparable. Not the same because the matching segments are not created using just the mothers or just the fathers contributed chromosomes. The tool is aggressive and trys to make as long a matching segment as possible from either allele in each SNP. So just to reiterate, half-identical matching in genetic genealogy is not Haploid chromosome match reporting. As close as we can generally get, but not the same. This is mostly an issue when comparing full-siblings and others that have full-identical regions. Half-identical and full-identical are genetic genealogy created terms to help simplify the awkward test results and process.
    (e) You hear 3.2 billion often but the number is not quite 3.1 billion by current counts. Or rounds to 3.3 billion depending on what you count in HG38. The answer depends on how you count bases — wait, what?. See https://www.ncbi.nlm.nih.gov/grc/human/data But I like to stick with 3.2 billion base pairs for the Haploid Genome and 3,600 cM for the autosomes “full” match with a parent as the rough numbers to use.

    Hopefully, this did not muddle it up even more. Biology is messy and computer algorithms try to make predictive sense of it. The testing tools (lab process) add their own mess of probabilities into the process.

    Liked by 1 person

    • I appreciate all your input. I already know what you posted, but I’m trying to keep this blog to the essential things genealogists will find useful. My main point was to clear up the confusion about base pairs (I had the same confusion when I started, and it’s easy to get off track with that term). Some of my earliest blog posts talked about fuzzy data and the fact that there is no conversion factor among bp, SNP and cM. The total number of basepairs in our DNA is not much help to genealogists and the number has changed every year since the full genome was first mapped. I was just trying to let folks look at their DNA and see that it roughly meshes with the big picture data they hear about. Your struggle to state the number of cMs in half our DNA illustrates the point – the total isn’t too important from a genealogy point of view. As I’ve posted before our Shared Segments and Triangulated Groups are fairly large “targets” that link to our Ancestors. The fuzziness is almost never critical to our genealogy objectives.

      Like

  5. Oh, meant to include in the clarification above also (not able to edit it). The “end locations” just happen to be where the last measured SNP is. Not the end of the chromosome in a base-pair count. Ditto for the start location which is never shown as zero. The actual start and ends of the chromosomes are varied and decrease with age of the cell / human body. So why your “end locations” parlor trick comes somewhat close, it is not an accurate representation.

    Like

    • Randy, Acknowledged – different companies measure shared segments different ways – you will almost never see exactly the same segment (from the same data) at different companies. In some cases the shared segments run long, sometimes they are shortened. It makes almost no difference in a genealogy context. Accept that the shared segment ends are a little fuzzy, and focus on the bulk of the shared segments and who Triangulates. I would ask if you know of a way to get a more “accurate representation” of segment ends, but it doesn’t make any difference. With all the fuzzy data we have now, the Triangulated Groups are going to overlap a little. The method (“parlor trick”) I came up with is a big help when sorting spreadsheets so that all the shared segments fall into a Triangulated Group. Jim

      Like

  6. Jim,
    I first discovered your blog, a few days ago, and went to the beginning in 2015 and have read all of your posts. You have helped to clear up so much confusion that I have gotten from reading other genetic DNA posts. Your explanations have been so much easier to understand. I am new to genetic DNA ( 6 months ) but have been doing genealogy for over 35 years. I am a visual person when I read especially if someone gives examples. Because of what I have previously read in reads and other blogs, I am struggling with the explained description at the beginng of this post on the chromosome. I know it is a spiral ladder, and the end of the rungs is where the molecules ATCG base pairs are located. My confusion comes in on how they are combined/read. Is only the side of the ladder being read as one genome or are the matched molecules on the opposite side included? Like you list AA, AC,AG. Or is it AT,AT,AT,CG etc? I’m confused because I thought that A always paired with T and C always paired with G. That is what is confusing me, are these other combinations of these molecules because one is from dad and one is from mom? Hope this is understandable. Thank you for your clarification.

    Like

    • Barbara,
      Thanks for your feedback. I was hoping this “Confusion” post would clear this up. Let me try it another way.
      You are correct that the two ends of each rung in the double-helix chromosomes of DNA are always paired A-T or C-G. This due to the chemistry bonds that make up the ladder rung – A only bonds with T, C only bonds with G. But in “reading” a chromosome, the process only examines one side of the ladder – one end of each rung. For example, Chromosome 01 has about 250 million rungs; so the there are 250 million “reads” from one end to the other. We usually use the “forward” read – which in chemistry is in the 5′ to 3′ direction – don’t worry the process knows how to do this [there is a free MIT course in Biology (and DNA) that is just starting up again. I’ll email the link, if you want to get into 5′, etc.] And as you point out, once you know this string of 250 million ACGTs, you also know the complementary side. So the other side of each rung doesn’t add any new information. All we have to know is that everyone uses the same forward read. In our case, genetic genealogy, the micro chips only read a total of 600,000 “special” or “variable” base pairs, called SNPs. We don’t need to know how much alike we are, we want to know the differences. If you and I are a Match, most of our SNPs will be different (or long stretches – segments – will be different) – it’s that one shared segment we are looking for.
      So while the two ends of a rung are always complementary, the adjacent rungs can be anything. The forward end of a string of ladder rungs could be AAAA, or AGCT, or any combination – they are basically independent. [that’s not entirely accurate, as each forward base pair (SNP) has a preferred value, and a secondary value, are very rarely one of the other two values of A, C, G, or T), but that’s another story]. The main point is that adjacent SNPs are not constrained by the A-T, C-G “rule”. And when a read is done, it actually looks at both of your chromosomes – on a forward path. Your two chromosome 01s have never touched each other chemically. They were independently combined in you mother and your father, and passed on to you. So the values (forward A,C,G,T) from one parent would be randomly different from the values in the same forward read on the chromosome from the other parent. This is where we get the 10 combinations: AA, AC, AG, AT, CC, CG, CT, GG, GT and TT. In this case GT is the same as TG, because we don’t know which parent’s chromosome is which. All those A, C, G, T forward-end-of-rung SNPs from both parents are floating around in the chemistry of reading the SNPs. They’ve been multiplied many times over. So they actually look at the percent of each one. If for one SNP you had 99% C and the other 1% was one value or a mix, we’d say that SNP result was CC for you – you got a C from each parent. If another SNP was 48% G and 49% T, we’d say that SNP result was GT for you (we don’t know which parents gave you the G and which provided the T). So in this part of the discussion we are looking at two values for each SNP, but we shouldn’t call them “pairs”.
      Hope this clears it up. Jim

      Like

      • Jim, you said “your two chromosome 01s have never touched each other chemically”. I thought this is what happens with recombination – a segment “jumps” or switches from one chromosome to the other. Can you clear this up for me?

        Like

      • Sure! The chromosomes in 99.999% of your cells are separate – think of them as worms in a garden – they don’t interact. In the very unique and specific case of a sperm or egg, they do touch and recombine. But what we test for genealogy is saliva – the chromosomes don’t interact. A SNP on a maternal chromosome and a SNP on the exact same location on a paternal chromosome are separate and independent.

        Jim Bartlett – atDNA blog: http://www.segmentology.org

        >

        Like

    • Quickly, there are two chromosomes of each of the autosomes (1-22) and two X chromosones in biological females. One value from each chromosome is being read. Hence the mixing of base pair names in the pair reported. You get each chromosome of the same type from a different parent.

      Like

      • Randy,
        I think we should refrain from calling the two reported SNPs at the same location a “pair”. There are two values – but they are completely independent values. They show up as two values in our raw data, but they are not the same as the base pairs in one chromosome. Jim

        Like

      • Good point. I usually use the term “unordered set”. You can delete the whole stream from me. Your reply had not cone through when I replied.

        Like

  7. This discussion is helpful, both Jim’s explanation and the questions/answers. The terminology has been a stumbling block for me too. Another basic idea I have never understood is this: if all these ACGTs are floating around in a soup during the analysis, how do they know what sequence they are in on the schromosome? How do they know which is the “front end”? Maybe this is not for us to understand, I just need to accept that technically they know how to do this. .

    Like

    • This was what sequencing the genome was all about. Each SNP has a unique address – both in terms of how many base pairs from the first one on a chromosome; as well as knowing the unique base pairs that lead up to the SNP of interest. It’s all done through chemistry – targeting each SNP and then reading it. And getting many reads to make sure one isn’t a fluke.

      Like

  8. Jim,
    Thank you thank you, this soo nailed it for me. I am very grateful for you taking the time to explain. This really removed a lot of confusion for me. I would love if you could send me that link to the MIT course. Again, thanks so much. Barb

    Like

    • I’ve decided to post the link here: https://www.edx.org/course/introduction-to-biology-the-secret-of-life-1 This is a full semester class from MIT and it is FREE – the Professor is world renowned, and a great teacher. There is a wealth of information, labs, etc., etc. and you can log in 24/7 and work at your own pace – each hour lecture is broken into about 10 minute segments with a one question quiz for you to see if you are grasping each concept.. At one point my wife asked me what I was doing up at 4am – I told her I was folding proteins in a lab project! What a hoot. The first part is the history of biology, starting with the Big Bang and working through the elements and then cells and organisms, etc. In case you can’t tell, I really enjoyed this class.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s