DeepVariant training data
DeepVariant training data
WGS models
version | Replicates | #examples |
---|---|---|
v0.4 | 9 HG001 | 85,323,867 |
v0.5 | 9 HG001 2 HG005 78 HG001 WES 1 HG005 WES(1) |
115,975,740 |
v0.6 | 10 HG001 PCR-free 2 HG005 PCR-free 4 HG001 PCR+ |
156,571,227 |
v0.7 | 10 HG001 PCR-free 2 HG005 PCR-free 4 HG001 PCR+ |
158,571,078 |
v0.8 | 12 HG001 PCR-free 2 HG005 PCR-free 4 HG001 PCR+ (and, more dowsample_fraction since last version) |
346,505,686 |
v0.9 | 10 HG001 PCR-free 2 HG005 PCR-free 2 HG006 PCR-free 2 HG007 PCR-free 5 HG001 PCR+ |
325,202,093 |
v0.10 | 10 HG001 PCR-free 2 HG005 PCR-free 2 HG006 PCR-free 2 HG007 PCR-free 5 HG001 PCR+ |
339,410,078 |
v1.0 | 11 HG001 2 HG005-HG007 2 HG002-HG004(7) |
317,486,837 |
v1.1 | 12 HG001 3 HG002 3 HG004 3 HG005 3 HG006 3 HG007(9) |
388,337,190 |
v1.2 | 12 HG001 6 HG002(12) 6 HG004(12) 3 HG005 3 HG006 3 HG007 |
518,709,296 |
v1.3 | Same model as v1.2 | |
v1.4 | 12 HG001 6 HG002(12) 6 HG004(12) 3 HG005 3 HG006 3 HG007 |
517,209,566 |
v1.5 | 13 HG001 14 HG002 8 HG004 9 HG005 4 HG006 4 HG007 |
815,200,320 |
WES models
version | Replicates | #examples |
---|---|---|
v0.5 | 78 HG001 1 HG005 |
15,714,062 |
v0.6 | 78 HG001 1 HG005(2) |
15,705,449 |
v0.7 | 78 HG001 1 HG005 |
15,704,197 |
v0.8 | 78 HG001 1 HG005(3) |
18,683,247 |
v0.9 | 81 HG001 1 HG005(3)(4)(5) |
61,953,965 |
v0.10 | Same model as v0.9 | |
v1.0 | 32 HG001 9 HG002 6 HG003 6 HG004 12 HG005 9 HG006 9 HG007(7) |
10,716,281 |
v1.1 | 41 HG001 9 HG002 6 HG004 12 HG005 9 HG006 9 HG007(9) |
13,450,688 |
v1.2 | 41 HG001 9 HG002 9 HG004 12 HG005 9 HG006 9 HG007(11) |
22,288,064 |
v1.3 | Same model as v1.2 | |
v1.4 | 41 HG001 9 HG002 9 HG004 12 HG005 9 HG006 9 HG007(11) |
21,212,424 |
v1.5 | 40 HG001 9 HG002 9 HG004 12 HG005 9 HG006 9 HG007 |
21,027,625 |
PACBIO models
version | Replicates | #examples |
---|---|---|
v0.8 | 16 HG002 | 160,025,931 |
v0.9 | 49 HG002 (6) | 357,507,235 |
v0.10 | 49 HG002, 2 HG003, 2 HG004, 1 HG002 (amplified) (6) | 472,711,858 |
v1.0 | 1 HG001 2 HG002 2 HG003 2 HG004 1 HG005 (8) |
302,331,948 |
v1.1 | 1 HG001 9 HG002 2 HG004 1 HG005(9) |
569,225,616 |
v1.2 | 1 HG001 19 HG002 2 HG004 1 HG005(10) |
1,036,056,726 |
v1.3 | 1 HG001 19 HG002 3 HG004 1 HG005 1 HG006 1 HG007 |
1,177,109,190 |
v1.4 | 1 HG001 19 HG002 3 HG004 1 HG005 1 HG006 1 HG007 |
1,177,596,708 |
v1.5 | 3 HG001 29 HG002 7 HG004 2 HG005 3 HG006 2 HG007 |
1,729,659,396 |
HYBRID models
version | Replicates | #examples |
---|---|---|
v1.0 | 10 HG002 1 HG004 1 HG005 1 HG006 1 HG007 |
193,076,623 |
v1.1 | Same model as v1.0 | |
v1.2 | 10 HG002 1 HG004 1 HG005 1 HG006 1 HG007 |
214,302,681 |
v1.3 | Same model as v1.2 | |
v1.4 | 10 HG002 1 HG004 1 HG005 1 HG006 1 HG007 |
215,863,645 |
v1.5 | 10 HG002 1 HG004 1 HG005 1 HG006 1 HG007 |
215,863,664 |
(1): In v0.5, we experimented with adding whole exome sequencing data into training data. In v0.6, we took it out because it didn’t improve the WGS accuracy.
(2): The training data are from the same replicates as v0.5. The number of examples changed because of the update in haplotype_labeler.
(3): In v0.8, we used the Platinum Genomes Truthset to create more training examples outside the GIAB confident regions.
(4): Previously, we split train/tune by leaving 3 WES for tuning. Starting from this release, we leave out chr1 and chr20 from training, and use chr1 for tuning.
(5): Starting from this version, we padded (100bps on
both sides) of the capture BED and used that for generating training examples.
We also added more downsample_fraction
.
(6): (Before v1.0) PacBio is the only one we currently uses HG002 in training and tuning.
(7): In v1.0, we train on HG002-HG004 for WGS as well, but only using examples from the region of NIST truth confident region v4.2 subtracting v3.3.2.
(8): In v1.0, PacBio training data contains training examples with haplotag sorted images and unsorted images.
(9): In v1.1, we exclude HG003 from training. And we use all NIST truth confident regions for HG001-HG007 (except for HG003) for training. We’ve always excluded chr20-22 from training.
(10): In v1.2, we include new PacBio training data from Sequel II, Chemistry 2.2.
(11): Between v1.1 and v1.2, we fixed an issue where make_examples can generate fewer class 0 (REF) training examples than before. This is the reason for more training examples in v1.2 when number of samples didn’t increase.
(12): In v1.2, we created BAM files with 100bp reads and 125bp reads by trimming to augment the training data.
Training data:
See “An Extensive Sequence Dataset of Gold-Standard Samples for Benchmarking and Development” for a publicly available set of data we released. Data download information can be found in the supplementary material.