How to make sense of 23andme raw data

The title of this post is not honest. This approach will not teach you how to make sense of the 23andme raw data. What it will do, however, is to supply you with the tools you need to do it.

This guide is written for Linux noobs, which is the level only slightly below myself. This approach works for me in a hobby setting. As such, I would welcome tips from seasoned bioinformaticians.

What this guide will not do is explain genetics or how to understand genetic variants. It is assumed that the reader is familiar with genetics and SNP data.

In order to use these steps, you will need to be able to navigate a Linux system by command line (although it’s certainly possible to do it through Windows and Linux emulation, but that’s probably even more messy).

Programs are executed on files located in the same folder in this guide, for simplicity.

The main steps are:

1. Download your raw data
2. Use 23andmetovcf to generate a .vcf file
3. Organize your data for Annovar input
4. Use Annovar to annotate the .vcf file and make a .csv file
5. Use Excel or a different analysis tool to navigate your data

I will go through these steps and supply the codes needed to do it.

Disclaimer: This guide is for educational and research purposes only. 23andme tests about 600-900k genetic variants. Some of these do have clinical associations – and some of them will be found to have clinical associations in the future. As such, you must have considered this before looking at your raw data according to these steps. You might find associations that are more confusing (!) than the report generated automatically by Promethase. If you are new to this, or personal genetics, I strongly suggest trying out a Promethase report first. 

You can also test this on the Mendel family raw data.


Step 1: Download your raw data from 23andme

Give it a name – for example mendel.txt.

It will look something like this:

Note that the human genome version is 37. This is not the newest version of the human genome, but is still the “standard” reference for most tools. Remember to always check that you are looking at 37 when looking up variants or positions from 23andme online.

This text file is about 15MB (5MB compressed) and contains all the information on your SNPs, including rs-IDs for the variants. Individual rsids can be looked up in dbSNP.


2. Use 23andmetovcf to generate a .vcf file

Download 23andme2vcf.

perl 23andme2vcf.pl mendel.txt mendel.vcf 4

You have now generated the file mendel.vcf.

Note: Indels are not supported by 23andmetovcf, which means we will lose these. The result is a file containing single nucleotide variants.


3. Organize your data for Annovar input

Download Annovar, which is free for personal, academic and non-profit use.

I used mendel.vcf to generate an input file for Annovar called mendel.avinput.

The input should be organized in five columns as:
|CHR | START | END | REF | ALT |
(chromosome number, start position, end position, reference allele and alternate (variant) allele.

Reference nucleotides can be entered as 0.

You can use Libre office to do this – remember to save as .avinput!


4. Use Annovar to annotate the .vcf file and make a .csv file

This is a very simple way to annotate according to the Annovar guide. The Annovar guide will give much better advice on how to do this in the best way. You may add and remove whatever information you like – I find this approach works fine for my purposes:

perl  table_annovar.pl mendel.avinput  humandb/ -buildver hg19 -out mendel_annotated -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500si_all,1000g2012apr_all,snp138,ljb23_all -operation g,r,r,f,f,f,f -nastring . -csvout

You should now have the file mendel_annotated.csv, an annotated version of most of your 23andme data.

Congratulations, you now have a file with information about your genetic variants (well, a couple of them!).

It should look something like this (some identifying information is covered by white-out in this image. Also, some rows are colored in grey because I found them uninteresting).


5. Use Excel or a different analysis tool to navigate your data

There exists many sophisticated programs for filtering these files to clear away the findings that don’t actually mean much.

Any spread sheet type of text editor with a “filter”-function should work. I would recommend Excel – simply because Libre Office is really clunky to filter with.

However, for most practical hobby purposes, Excel works very well. For example, you can choose to look at only exonic variants, or only stop gains. You can also sort these by “pathogenicity”, that is, the severity of effect the variant has on the encoded gene product (for example by sorting PolyPhen2 or SIFT values).

Note: I have not figured out how to carry over genotype-information (homozygosity/heterozygosity) to the .csv-file. Annovar indicates that all rows after the first five will be kept as is, but I have not found this to be dependable. To check genotype, either use 23andme.com or look to the .vcf-file (for example, searching for position or rsid).

Leave a comment