HTSlib 64 bit reference positions
HTSlib 64 bit reference positions
HTSlib version 1.10 onwards internally use 64 bit reference positions. This is to support analysis of species like axolotl, tulip and marbled lungfish which have, or are expected to have, chromosomes longer than two gigabases.
File format support
Currently 64 bit positions can only be stored in SAM and VCF format files. Binary BAM, CRAM and BCF cannot be used due to limitations in the formats themselves. As SAM and VCF are text formats, they have no limit on the size of numeric values. Note that while 64 bit positions are supported by default for SAM, for VCF they must be enabled explicitly at compile time by editing Makefile and adding -DVCF_ALLOW_INT64=1 to CFLAGS.
Compatibility issues to check
Various data structure members, function parameters, and return values have been expanded from 32 to 64 bits. As a result, some changes may be needed to code that uses the library, even if it does not support long references.
Variadic functions taking format strings
The type of various structure members (e.g. bam1_core_t::pos
) and return
values from some functions (e.g. bam_cigar2rlen()
) have been changed to
hts_pos_t
, which is a 64-bit signed integer. Using these in 32-bit
code will generally work (as long as the stored positions are within range),
however care needs to be taken when these values are passed directly
to functions like printf()
which take a variable-length argument list and
a format string.
Header file htslib/hts.h
defines macro PRIhts_pos
which can be
used in printf()
format strings to get the correct format specifier for
an hts_pos_t
value. Code that needs to print positions should be
changed from:
printf("Position is %d\n", bam->core.pos);
to:
printf("Position is %"PRIhts_pos"\n", bam->core.pos);
If for some reason compatibility with older versions of HTSlib (which do
not have hts_pos_t
or PRIhts_pos
) is needed, the value can be cast to
int64_t
and printed as an explicitly 64-bit value:
#include <inttypes.h> // For PRId64 and int64_t
printf("Position is %" PRId64 "\n", (int64_t) bam->core.pos);
Passing incorrect types to variadic functions like printf()
can lead
to incorrect behaviour and security risks, so it important to track down
and fix all of the places where this may happen. Modern C compilers like
gcc (version 3.0 onwards) and clang can check printf()
and scanf()
parameter types for compatibility against the format string. To
enable this, build code with -Wall
or -Wformat
and fix all the
reported warnings.
Where functions that take printf
-style format strings are implemented,
they should use the appropriate gcc attributes to enable format string
checking. htslib/hts_defs.h
includes macros HTS_FORMAT
and
HTS_PRINTF_FMT
which can be used to provide the attribute declaration
in a portable way. For example, test/sam.c
uses them for a function
that prints error messages:
void HTS_FORMAT(HTS_PRINTF_FMT, 1, 2) fail(const char *fmt, ...) { /* ... */ }
Implicit type conversions
Conversion of signed int
or int32_t
to hts_pos_t
will always work.
Conversion of hts_pos_t
to int
or int32_t
will work as long as the value
converted is within the range that can be stored in the destination.
Code that casts unsigned uint32_t
values to signed with the expectation
that the result may be negative will no longer work as hts_pos_t
can store
values over UINT32_MAX. Such code should be changed to use signed values.
Functions hts_parse_region() and hts_parse_reg64() return special value
HTS_POS_MAX
for regions which extend to the end of the reference.
This value is slightly smaller than INT64_MAX, but should be larger than
any reference that is likely to be used. When cast to int32_t
the
result should be INT32_MAX
.
Upgrading code to work with 64 bit positions
Variables used to store reference positions should be changed to
type hts_pos_t
. Use PRIhts_pos
in format strings when printing them.
When converting positions stored in strings, use strtoll()
in place of
atoi()
or strtol()
(which produces a 32 bit value on 64-bit Windows and
all 32-bit platforms).
Programs which need to look up a reference sequence length from a sam_hdr_t
structure should use sam_hdr_tid2len()
instead of the old
sam_hdr_t::target_len
array (which is left as 32-bit for reasons of
compatibility). sam_hdr_tid2len()
returns hts_pos_t
, so works correctly
for large references.
Various functions which take pointer arguments have new versions which
support hts_pos_t *
arguments. Code supporting 64-bit positions should
use the new versions. These are:
Original function | 64-bit version |
---|---|
fai_fetch() | fai_fetch64() |
fai_fetchqual() | fai_fetchqual64() |
faidx_fetch_seq() | faidx_fetch_seq64() |
faidx_fetch_qual() | faidx_fetch_qual64() |
hts_parse_reg() | hts_parse_reg64() or hts_parse_region() |
bam_plp_auto() | bam_plp64_auto() |
bam_plp_next() | bam_plp64_next() |
bam_mplp_auto() | bam_mplp64_auto() |
Limited support has been added for 64-bit INFO values in VCF files, for large
values in structural variant END tags. New functions bcf_update_info_int64()
and bcf_get_info_int64()
can be used to set and fetch 64-bit INFO values.
They both take arrays of int64_t
. bcf_int64_missing
and
bcf_int64_vector_end
can be used to set missing and vector end values in
these arrays. The INFO data is stored in the minimum size needed, so there
is no harm in using these functions to store smaller integer values.
Structure members that have changed size
File htslib/hts.h:
hts_pair32_t::begin
hts_pair32_t::end
(typedef hts_pair_pos_t is provided as a better-named replacement for hts_pair32_t)
hts_reglist_t::min_beg
hts_reglist_t::max_end
hts_itr_t::beg
hts_itr_t::end
hts_itr_t::curr_beg
hts_itr_t::curr_end
File htslib/regidx.h:
reg_t::start
reg_t::end
File htslib/sam.h:
bam1_core_t::pos
bam1_core_t::mpos
bam1_core_t::isize
File htslib/synced_bcf_reader.h:
bcf_sr_regions_t::start
bcf_sr_regions_t::end
bcf_sr_regions_t::prev_start
File htslib/vcf.h:
bcf_idinfo_t::info
bcf_info_t::v1::i
bcf1_t::pos
bcf1_t::rlen
Functions where parameters or the return value have changed size
Functions are annotated as follows:
[new]
The function has been added since version 1.9[parameters]
Function parameters have changed size[return]
Function return value has changed size
File htslib/faidx.h:
[new] fai_fetch64()
[new] fai_fetchqual64()
[new] faidx_fetch_seq64()
[new] faidx_fetch_qual64()
[new] fai_parse_region()
File htslib/hts.h:
[parameters] hts_idx_push()
[new] hts_parse_reg64()
[parameters] hts_itr_query()
[parameters] hts_reg2bin()
File htslib/kstring.h:
[new] kputll()
File htslib/regidx.h:
[parameters] regidx_overlap()
File htslib/sam.h:
[new] sam_hdr_tid2len()
[return] bam_cigar2qlen()
[return] bam_cigar2rlen()
[return] bam_endpos()
[parameters] bam_itr_queryi()
[parameters] sam_itr_queryi()
[new] bam_plp64_next()
[new] bam_plp64_auto()
[new] bam_mplp64_auto()
[parameters] sam_cap_mapq()
[parameters] sam_prob_realn()
File htslib/synced_bcf_reader.h:
[parameters] bcf_sr_seek()
[parameters] bcf_sr_regions_overlap()
File htslib/tbx.h:
[parameters] tbx_readrec()
File htslib/vcf.h:
[parameters] bcf_readrec()
[new] bcf_update_info_int64()
[new] bcf_get_info_int64()
[return] bcf_dec_int1()
[return] bcf_dec_typed_int1()