DeepVariant training data

WGS models

version Replicates #examples
v0.4 9 HG001 85,323,867
v0.5 9 HG001
2 HG005
78 HG001 WES
1 HG005 WES(1)
115,975,740
v0.6 10 HG001 PCR-free
2 HG005 PCR-free
4 HG001 PCR+
156,571,227
v0.7 10 HG001 PCR-free
2 HG005 PCR-free
4 HG001 PCR+
158,571,078
v0.8 12 HG001 PCR-free
2 HG005 PCR-free
4 HG001 PCR+
(and, more dowsample_fraction since last version)
346,505,686
v0.9 10 HG001 PCR-free
2 HG005 PCR-free
2 HG006 PCR-free
2 HG007 PCR-free
5 HG001 PCR+
325,202,093
v0.10 10 HG001 PCR-free
2 HG005 PCR-free
2 HG006 PCR-free
2 HG007 PCR-free
5 HG001 PCR+
339,410,078
v1.0 11 HG001
2 HG005-HG007
2 HG002-HG004(7)
317,486,837
v1.1 12 HG001
3 HG002
3 HG004
3 HG005
3 HG006
3 HG007(9)
388,337,190
v1.2 12 HG001
6 HG002(12)
6 HG004(12)
3 HG005
3 HG006
3 HG007
518,709,296
v1.3 Same model as v1.2  
v1.4 12 HG001
6 HG002(12)
6 HG004(12)
3 HG005
3 HG006
3 HG007
517,209,566
v1.5 13 HG001
14 HG002
8 HG004
9 HG005
4 HG006
4 HG007
815,200,320

WES models

version Replicates #examples
v0.5 78 HG001
1 HG005
15,714,062
v0.6 78 HG001
1 HG005(2)
15,705,449
v0.7 78 HG001
1 HG005
15,704,197
v0.8 78 HG001
1 HG005(3)
18,683,247
v0.9 81 HG001
1 HG005(3)(4)(5)
61,953,965
v0.10 Same model as v0.9  
v1.0 32 HG001
9 HG002
6 HG003
6 HG004
12 HG005
9 HG006
9 HG007(7)
10,716,281
v1.1 41 HG001
9 HG002
6 HG004
12 HG005
9 HG006
9 HG007(9)
13,450,688
v1.2 41 HG001
9 HG002
9 HG004
12 HG005
9 HG006
9 HG007(11)
22,288,064
v1.3 Same model as v1.2  
v1.4 41 HG001
9 HG002
9 HG004
12 HG005
9 HG006
9 HG007(11)
21,212,424
v1.5 40 HG001
9 HG002
9 HG004
12 HG005
9 HG006
9 HG007
21,027,625

PACBIO models

version Replicates #examples
v0.8 16 HG002 160,025,931
v0.9 49 HG002 (6) 357,507,235
v0.10 49 HG002, 2 HG003, 2 HG004, 1 HG002 (amplified) (6) 472,711,858
v1.0 1 HG001
2 HG002
2 HG003
2 HG004
1 HG005 (8)
302,331,948
v1.1 1 HG001
9 HG002
2 HG004
1 HG005(9)
569,225,616
v1.2 1 HG001
19 HG002
2 HG004
1 HG005(10)
1,036,056,726
v1.3 1 HG001
19 HG002
3 HG004
1 HG005
1 HG006
1 HG007
1,177,109,190
v1.4 1 HG001
19 HG002
3 HG004
1 HG005
1 HG006
1 HG007
1,177,596,708
v1.5 3 HG001
29 HG002
7 HG004
2 HG005
3 HG006
2 HG007
1,729,659,396

HYBRID models

version Replicates #examples
v1.0 10 HG002
1 HG004
1 HG005
1 HG006
1 HG007
193,076,623
v1.1 Same model as v1.0  
v1.2 10 HG002
1 HG004
1 HG005
1 HG006
1 HG007
214,302,681
v1.3 Same model as v1.2  
v1.4 10 HG002
1 HG004
1 HG005
1 HG006
1 HG007
215,863,645
v1.5 10 HG002
1 HG004
1 HG005
1 HG006
1 HG007
215,863,664

(1): In v0.5, we experimented with adding whole exome sequencing data into training data. In v0.6, we took it out because it didn’t improve the WGS accuracy.

(2): The training data are from the same replicates as v0.5. The number of examples changed because of the update in haplotype_labeler.

(3): In v0.8, we used the Platinum Genomes Truthset to create more training examples outside the GIAB confident regions.

(4): Previously, we split train/tune by leaving 3 WES for tuning. Starting from this release, we leave out chr1 and chr20 from training, and use chr1 for tuning.

(5): Starting from this version, we padded (100bps on both sides) of the capture BED and used that for generating training examples. We also added more downsample_fraction.

(6): (Before v1.0) PacBio is the only one we currently uses HG002 in training and tuning.

(7): In v1.0, we train on HG002-HG004 for WGS as well, but only using examples from the region of NIST truth confident region v4.2 subtracting v3.3.2.

(8): In v1.0, PacBio training data contains training examples with haplotag sorted images and unsorted images.

(9): In v1.1, we exclude HG003 from training. And we use all NIST truth confident regions for HG001-HG007 (except for HG003) for training. We’ve always excluded chr20-22 from training.

(10): In v1.2, we include new PacBio training data from Sequel II, Chemistry 2.2.

(11): Between v1.1 and v1.2, we fixed an issue where make_examples can generate fewer class 0 (REF) training examples than before. This is the reason for more training examples in v1.2 when number of samples didn’t increase.

(12): In v1.2, we created BAM files with 100bp reads and 125bp reads by trimming to augment the training data.

Training data:

See “An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development” for a publicly available set of data we released. Data download information can be found in the supplementary material.