Confusion about Base Pairs

A Segment-ology TIDBIT

Let’s sort this out. A Chromosome is a long string of DNA – which has the form of the famous double helix. If we flattened out the double helix it would look like a ladder, with two sides connected by lots of rungs. On each end of every rung is a molecule we call a base – called A, C, G or T for short. The two ends of each rung are always paired, with A on one end and T on the other end, or C on one end and G on the other. That’s because in chemistry, the A molecule bonds much more readily with a T; and a C bonds easily with a G. They form what is called a base pair. And if you know one end of each rung, you know the other end. 23 chromosomes make up a genome, and a genome has about 3 billion of these base pairs*.

As we look at one side of the chromosome “ladder” we see one of these molecules at every rung. Important: There is no hard and fast rule about the order of the ACGTs along one side of the ladder.

In our bodies we have two genomes – one set of chromosomes from the father and one set from the mother.

For atDNA testing a laboratory looks at, say, 600,000 specific base pairs called SNPs (pronounced snips). Each of these SNPs is at a specific location on a chromosome, and the lab looks at one side (the “forward” side) and determines if it is an A, C, G, or T. Because we have two of each chromosome, they actually get two values (called alleles), one from the paternal chromosome, and one from the maternal chromosome. Because all these SNPs are floating around in a soup, we don’t know which one is from Mom and which one is from Dad. One convention is to list them alphabetically, resulting in ten possibilities: AA, AC, AG, AT, CC, CG, CT, GG, GT, and TT.  You can see that in these “pairs” the A is not necessarily paired with T. That’s because the DNA from each parent came to you from very different, and usually very distant, paths – they don’t touch or interact with each other. And, the SNP base pairs were chosen for and atDNA test, because they offer variability.

A shared DNA segment between you and a Match consists of a long string of SNPs (usually 1,000 or more) where you have at least one of your two alleles match at least one of your Match’s two alleles. The longer the shared segment, the greater the probability that it had to come from a Common Ancestor.

BOTTOM LINE – As genetic genealogists we are not concerned with the “base pairs” on each end of a rung, we are very much interested in the two SNP alleles we got from our two parents – not called “base pairs.”

*[An experiment you can do at home: at GEDmatch compare your kit to your kit in the one-to-one utility – you’ll match on 22 chromosomes, from start to finish. Add up the “End Locations” and see how close to 3 billion you come – add in about 155 million for Chr X to get a full genome].


[22U] Segment-ology: Confusion about Base Pairs TIDBIT by Jim Bartlett 20180502

18 thoughts on “Confusion about Base Pairs

  1. I try to explain g meticulous genealogy and my two relatives want to hear a scientific explanation and a study to explain why it’s not just a money making scam. Can you please punt me to something I can hand them. The genetic md has seen the print outs and laughs. The Engineer just asked for a second test to compare. Results. I don’t debate well.

    Thanks, Margery Webb


    • Marjorie, I, too, am an engineer – registered in Virginia for almost 40 years. I think that’s why I like genetic genealogy so much – a confluence of our genealogy puzzle (filling in all the ancestral blocks) and the genetic puzzle (linking specific pieces of our DNA back to specific ancestors – aka chromosome mapping) To me this is the ultimate puzzle: where did my building blocks come from? I try to keep it on a plain English level, but I have had to study a lot of biology – including a free only semester from MIT: DNA the Secret of Life. About the best single resource for us is – particularly the autosomal tab – with pages with many references and links. You can go about as deep as you want. But I always come back to the fundamental tenant: if two people share enough matching DNA, the only way that could happen is if they got it from a Common Ancestor.



  2. Hi Jim
    Thank you for the explicit information in this blog. However it is my understanding that although we have 46 chromosomes per human cell this constitutes 1 genome per cell not two. Am I misunderstanding the description? Thank you for your help
    Sue L


    • Susan, I’ve seen “genome” defined both ways and used both ways – many times over. Maybe I should have used different wording so as not to distract from the important message. Jim


      • Jim, That is good to know. I always want to know what is out there and I am more isolated than you. Now I’ll know what people mean when discussing a single individual’s genomes. Thank you for taking the time to clarify the situation for me.
        Sue L


    • Doris, the rs numbers represent the unique address for each SNP. Some companies use a different mix of SNPs. It’s important to identify exactly which ones were tested, so they are always compared correctly with other test. I think the rs numbers refer to some look up table that tells a scientist how to “find” each SNP. Think of the rs numbers as the number in your street address. Jim


      • Thank you, Jim. That’s exactly what I wanted to know. I was confused by the 3 million vs. 600,000 base pairs and didn’t equate that to SNPs. Now I understand.


  3. Thanks for clarifying this, Jim. However, we are concerned with base pairs in one sense, aren’t we? The start and stop numbers for matching segments are in units of base pairs, right? Another kind of address.


  4. A great explanation to distinguish base-pairs from the pair of alleles one sees in RAW data files. Just a few quick comments (FYI, I am an engineer also and not a biologist so forgive as I am learning as well):
    (a) You need to include Y and its roughly 58 million base pairs.
    (b) While most/all report 3.2 billion base pairs in the genome, that is only the haploid count (or counting half the chromosomes that exist in each cell). We actually have ~6.4 mil base pairs (roughly) in the nuclear DNA (the Diploid chromosomes). Males a bit less. This may be the source of confusion when seeing the genome described both ways.
    (c) The GEDMatch tool is reporting on centimorgans (cM). While many like to equate 1 centimorgan to 1 million base pairs, this is not really true. (Note: deserving of its own post to simpify and disambiguate!) Overall, it comes out about the same. But not when looking at individual sections of DNA. I have seen 10 centimorgan match strands that have over 25 million base pairs. And the reverse. Some tools report a match between Parent and child as ~3,400 cM but most closer to ~3,600cM. The latter sometimes when including X matching in the report. GEDMatch reports ~3,600cM for autosomes (1-22) and another ~200cm for X. Almost always, the cM total count is higher than the base pair count (after multiplying cM by 1 million).
    (d) The GEDMatch chart is really only showing the half-identical match process. It indicates, in the graphic form, full-identical match areas. This is not really the same as Haploid and Diploid but is comparable. Not the same because the matching segments are not created using just the mothers or just the fathers contributed chromosomes. The tool is aggressive and trys to make as long a matching segment as possible from either allele in each SNP. So just to reiterate, half-identical matching in genetic genealogy is not Haploid chromosome match reporting. As close as we can generally get, but not the same. This is mostly an issue when comparing full-siblings and others that have full-identical regions. Half-identical and full-identical are genetic genealogy created terms to help simplify the awkward test results and process.
    (e) You hear 3.2 billion often but the number is not quite 3.1 billion by current counts. Or rounds to 3.3 billion depending on what you count in HG38. The answer depends on how you count bases — wait, what?. See But I like to stick with 3.2 billion base pairs for the Haploid Genome and 3,600 cM for the autosomes “full” match with a parent as the rough numbers to use.

    Hopefully, this did not muddle it up even more. Biology is messy and computer algorithms try to make predictive sense of it. The testing tools (lab process) add their own mess of probabilities into the process.

    Liked by 1 person

    • I appreciate all your input. I already know what you posted, but I’m trying to keep this blog to the essential things genealogists will find useful. My main point was to clear up the confusion about base pairs (I had the same confusion when I started, and it’s easy to get off track with that term). Some of my earliest blog posts talked about fuzzy data and the fact that there is no conversion factor among bp, SNP and cM. The total number of basepairs in our DNA is not much help to genealogists and the number has changed every year since the full genome was first mapped. I was just trying to let folks look at their DNA and see that it roughly meshes with the big picture data they hear about. Your struggle to state the number of cMs in half our DNA illustrates the point – the total isn’t too important from a genealogy point of view. As I’ve posted before our Shared Segments and Triangulated Groups are fairly large “targets” that link to our Ancestors. The fuzziness is almost never critical to our genealogy objectives.


  5. Oh, meant to include in the clarification above also (not able to edit it). The “end locations” just happen to be where the last measured SNP is. Not the end of the chromosome in a base-pair count. Ditto for the start location which is never shown as zero. The actual start and ends of the chromosomes are varied and decrease with age of the cell / human body. So why your “end locations” parlor trick comes somewhat close, it is not an accurate representation.


    • Randy, Acknowledged – different companies measure shared segments different ways – you will almost never see exactly the same segment (from the same data) at different companies. In some cases the shared segments run long, sometimes they are shortened. It makes almost no difference in a genealogy context. Accept that the shared segment ends are a little fuzzy, and focus on the bulk of the shared segments and who Triangulates. I would ask if you know of a way to get a more “accurate representation” of segment ends, but it doesn’t make any difference. With all the fuzzy data we have now, the Triangulated Groups are going to overlap a little. The method (“parlor trick”) I came up with is a big help when sorting spreadsheets so that all the shared segments fall into a Triangulated Group. Jim


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s