December 30 2025

A trio-binning approach for genome assembly reveals extensive structural variation between two Cannabis cultivars: Punto Rojo and Cherry Pie

Abstract

With the advent of long-read DNA sequencing technologies, assembling eukaryotic genomes has become routine; however, properly phasing the maternal and paternal contributions, which is of great value for breeding programs, remains technically challenging. Here, we use the trio-binning approach to separate Oxford Nanopore reads derived from a Cannabis F1 wide cross, made between the Colombian landrace Punto Rojo and the Colorado CBD clone Cherry Pie #16. Reads were obtained from a single PromethION flow cell, generating assemblies with coverage of just 18 × per haplotype, but with good contiguity and gene completeness, demonstrating that it is a cost-effective approach for genome-wide and high-quality haplotype phasing. Evaluated through the lenses of disease resistance and secondary metabolite synthesis, both being traits of interest for the Cannabis industry, we report copy number and structural variation that, as has recently been shown for other major crops, may contribute to phenotypic variation along several relevant dimensions.

Article type: Research Article

Keywords: genome assembly, Oxford Nanopore, trio binning, Cannabis, terpene synthases, NLR genes

Authors: Brett Pike ⚬, Alexander Kozik ⚬, Wilson Terán ⚬

Affiliations: Biología de Plantas y Sistemas Productivos, Departamento de Biología, Pontificia Universidad Javeriana, Bogotá, Distrito Capital 110231, Colombia; Medicamentos de Cannabis SAS, Bogotá, Distrito Capital 111111, Colombia; Genome Center and Department of Plant Sciences, University of California, Davis, California, CA 95616, United States

License: © The Author(s) 2025. Published by Oxford University Press on behalf of The Genetics Society of America. CC BY 4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Article links: DOI: 10.1093/g3journal/jkaf286 | PubMed: 41467882 | PMC: PMC12869074

Relevance: Relevant: mentioned in keywords or abstract

Full text: PDF (2.4 MB)

Introduction

Cannabis is a dioecious annual crop, and its closest relative is Humulus, a genus of three species whose most famous member is H. lupus, or brewers’ hops. Divergence from their common ancestor is thought to have taken place about 28 MYA in what is today northeast Tibet (ref. McPartland et al. 2019). Cannabis landraces spread to Southeast and Southwest Asia (ref. Ren et al. 2021), and later, among other dispersals, to Africa and then South America (ref. Warf 2014).

Following 100 years of prohibition, Cannabis is again legal in many countries and jurisdictions, driven by its growing acceptance and awareness of its potential therapeutic benefits. This has boosted cannabis research and given rise to the medical cannabis industry, with a market valued at $21.4 billion for 2025, expected to surpass $200 B in the next decade (ref. Metatech Insights 2024). Despite the economic and cultural importance of Cannabis, it is notable that genetic resources are scant (ref. Kovalchuk et al. 2020), highlighting also the need for modern breeding programs to accompany this global market growth. Cannabis genomics has, therefore, appeared as an emerging topic to fill the lack of genetic knowledge.

The first Cannabis genome to be anointed as the reference by NCBI, a CBD type from Colorado called cs10 (ref. Grassa et al. 2021), offers good contiguity and genic content, and so we have used it as the primary point of comparison in our analyses. However, as a collapsed pseudohaploid, its scaffolds cannot represent the true range of variation found within an individual, and as a modern polyhybrid, it cannot inform as to the ancestral state of the Cannabis population’s founders. In an effort to address this lacuna, we have sequenced an F₁ derived from two distantly related parents, which vary for several agronomic traits of interest: height, flowering time, cannabinoid content, terpene content, and fungal susceptibility.

To facilitate comparative genomics and establish a genome-wide resource for trait mapping and marker development, we assembled both haplotypes of this wide cross via trio-binning of Oxford Nanopore reads. This approach allowed us to obtain fully phased chromosome-scale assemblies with good contiguity and gene completeness, which provide accurate catalogs of important gene families, specifically disease resistance genes of the Nucleotide-binding, Leucine-rich Repeat type (NLRs) and terpene synthases (TPS).

Materials and methods

Breeding materials

The sequenced individual was an F₁ hybrid between the psychoactive Colombian landrace “Punto Rojo #3” (PR) and the nonpsychoactive Coloradan line “Cherry Pie #16” (CP). Both parental clones have been formally characterized and registered with the Instituto Colombiano Agropecuario (ICA) by Medicamentos de Cannabis SAS.

Punto Rojo is thought to descend from dual-use (drug and fiber) African cannabis introduced to Colombia in the 17th century (ref. Warf 2014), and has acclimatized almost entirely in the absence of irrigation, fertilization, and agrochemicals. It has good resistance to fungi and grows well in high heat and low-nutrient soil. The name translates as “Red Point” and refers to the unusual levels of anthocyanin sometimes seen in new shoots and receptive calyces (Fig. 1). In the 60s and 70s, illicit shipments of Type I (THC-dominant) Punto Rojo found favor among American consumers due to its special effects, which were thought to be more psychedelic and less soporific than other imports (ref. Kala 2021).

PMC12869074 – jkaf286-F1 — **Fig. 1.:** The Punto Rojo phenotype may describe anthocyanin deposition in the calyxes (left) or new shoots (right). Photos by Brett Pike.

Cherry Pie (Fig. 2) is one of several Type III (CBD-dominant) strains in the Cherry family, bred in the American state of Colorado following legalization. Cherry Blossom (ref. Anderson et al. 2021) and Cherry Wine (ref. DiMatteo et al. 2020) have been the subjects of recent reports, and the initial NCBI reference for Cannabis, CBDRx (ref. Grassa et al. 2021), falls into this clade as well. All display fast flowering and high CBD content, as well as a pleasant cherry aroma. The CP-16 individual was selected for consistently containing less than 1% THC at maturity, which enables its registration as non-psychoactive under Colombian law. This permits unlimited cultivation for any licensed cultivator, without diminishing Colombia’s share of the global THC quota established by the United Nations Office on Drugs and Crime.

PMC12869074 – jkaf286-F2 — **Fig. 2.:** Clones of Cherry Pie #16 flowering in Fuente de Oro, Meta, Colombia. Photo courtesy of Medcann Pharma.

Plant growth

The F₀s and the F₁ were grown at the licensed farm of Medicamentos de Cannabis SAS near Fuente de Oro, Meta, Colombia, as approved by the Ministry of Justice in Resolution 1164 of August 19, 2021. At this latitude (3.47° N), the photoperiod is consistently 12 h, and therefore always inductive for Cannabis flowering. At this altitude (400 m), the average day and night temperatures are 30 °C and 21 °C. The F₀ clones had previously been selected from seed and then propagated clonally.

Clones were rooted in Oasis-type plugs under fluorescent lamps and then transplanted to 15 L containers filled with 70% coco fiber and 30% worm castings, watered by hand, in a trailer about 2 m × 4 m, fitted with 2 1,000 W HPS lamps and an air conditioner set to 16C. The CP-16 female was induced to produce female (XX) pollen via two applications of 0.03% silver nitrate, at 0 and 7 d of flowering, which was then blown towards a group of females, including PR, with the aid of an oscillating fan. F₁ seeds were sown in two 144-cell trays and, after 21 d, 250 seedlings were transplanted to 3 L containers filled with a mix of 70% coco fiber and 30% worm castings. These plants grew vegetatively for a total of 60 d with 12 h of sunlight and supplemental lighting from 6 pm to midnight. They were next transplanted to the field at a density of 2 plants per square meter into holes amended with one handful of a mix consisting of 50% worm castings, 20% rock phosphate, 20% dolomite lime, and 10% Peruvian bat guano. The plants were rain-fed, with additional watering by hand as needed.

About 40 d after transplant to the natural inductive photoperiod, an individual (PC-67) was chosen that was approximately average for the population in terms of height, flower development, leaf morphology, internode spacing, and degree of branching. As well, its flowers produced an aroma that evoked both the red fruit odor of Cherry Pie and the citric tanginess of Punto Rojo. PC-67 was cloned and propagated vegetatively, and about 12 wk later, new shoots consisting primarily of unexpanded leaves were sampled for DNA sequencing.

DNA purification

DNA from the F₀s was extracted from new shoots dried over silica with a Quick-DNA Plant/Seed Miniprep Kit (Zymo Research, Irvine, California, USA). For the F₁, HMW DNA was purified from clean nuclei as described previously (ref. Pike et al. 2021) and then size-selected via the Short Read Eliminator XL kit (Circulomics, Baltimore, Maryland, USA). Several replicates were combined to yield a sufficient quantity. DNA concentration and purity were estimated through the use of NanoDrop 1000 (Thermo Scientific), and two additional ethanol washes on SPRI beads were performed to meet sequencing standards.

F₀ Illumina library prep and sequencing

The F₀s were prepared as Illumina TruSeq libraries and sequenced as part of a NovaSeq PE150 lane. Illumina reads were filtered with BBDuk (ref. Bushnell 2018) to remove adapter sequences low-quality reads, and short reads using default parameters. These reads were filtered against Cannabis chloroplast and mitochondrial genomes using CLC Genomics Workbench, for subsequent assembly into contigs. The resulting sequences were used as BLAST queries, using MegaBLAST with default parameters in Geneious Prime, against a custom database comprising the genomes of seven fungi known or suspected to be present in the field: Aspergillus fumigatus, Botrytis cinerea, Cercospora beticola, Fusarium oxysporum, Pseudocercospora fijiensis, Pseudocercospora musae, and Sclerotinia sclerotiorum. Following these results, reads were then mapped with BBSplit (ref. Bushnell 2018) to the genomes of P. fijiensis and P. musae, as the final filtering step in order to remove these contaminating sequences.

F₁ Oxford Nanopore library prep and sequencing

The HMW sample was analyzed for length distribution via Agilent Femto-Pulse. Then, an Oxford Nanopore library was prepared (ligation kit LSK-0110) and sequenced in one PromethION R9.4.1 cell. After 24 h, a nuclease flush was performed, and the library was then reloaded and sequenced for another 72 h. Basecalling was performed by Guppy 5.1.12 in “super-accurate” mode.

All library preparation and sequencing took place at The Genome Center at the University of California, Davis.

Genome assemblies

Precise syntax for each command may be found at https://github.com/COMInterop/PRCP. Specific versions of programs used are listed in Supplementary Table 6.

Genome size estimation

21-mers were counted in both sets of F₀ short reads with jellyfish (ref. Marçais and Kingsford 2011) and histograms evaluated with findGSE (ref. Sun et al. 2018) in homozygous and heterozygous mode, with the latter using expected homozygous coverage of 18 (exp_hom = 18). This process was repeated with the binned, error-corrected F₁ long reads.

Assembly

Trio binning was performed with scripts written for the purpose (ref. Rice 2019). Briefly, 21-mers were counted with KMC (ref. Kokot et al. 2017), unique parental 21-mers were derived by “find-unique-kmers,” and 21-mers containing homopentamer repeats were deleted with a simple grep command. These lists were then used with “classify_by_kmers” to sort long reads into PR, CP, and unknown bins.

Binned reads were assembled into contigs with NECAT (ref. Chen et al. 2021), and the unbinned reads were ignored. Assembly included all reads longer than 3 kb with the default parameters and “polish contigs = false”. Contigs identified as mitochondrial by NCBI were removed. Assembly transpired on an AWS EC2 “m6gd.metal” instance, with 64 ARM cores and 256 Gb RAM.

Polishing

Each haplotype’s binned raw reads were filtered for quality at 7 and aligned to their assembly with Minimap2 (ref. Li 2021), with options “-aL -z 600,200 -x map-ont”. One round of polishing then took place with Racon (ref. Vaser et al. 2017) with the “-u” option. Next, the appropriate F₀ short reads were mapped to each haplotype with BWA MEM (ref. Li and Durbin 2009) and polished with Clair3, twice. In the first round, Clair3 used the options “–haploid_precise –no_phasing_for_fa,” which only generates well-supported 1/1 calls. In the second round, all variants were called: 0/1 calls were deleted, 1/1 calls were applied, and where possible the shorter allele in 1/2 calls was applied with the command “bcftools consensus -H SR” Finally, each assembly was polished 4 times with its F₀ kmers with ntEdit (ref. Warren et al. 2019), using default settings and kmer lengths of 40, 26, 40, and 26.

Polishing and other post-assembly processing took place on a 2012 Mac Pro 5,1 with 2 Xeon X5690 processors and 64 Gb RAM.

Scaffolding

Scaffolding was performed with ntJoin (ref. Coombe et al. 2020) with options “nocut = True” and “overlap = False,” and a maximum gap of 100,000 bp. The substrate was derived from the Salk Institute’s recent release of many phased haplotypes (ref. Lynch et al. 2025), which was subsetted to include 8 drug haplotypes assembled with the benefit of Hi-C libraries. PR and CP contigs were first aligned to each haplotype, and alignments were inspected visually with dotplotly (ref. Poorten 2017). For each chromosome, the homolog with the most diagonal alignment was chosen. Then, a small number of additional substitutions were made to reduce interchromosomal translocations. The superscaffolds ultimately used for each genotype are listed in Supplementary Table 1.

Finally, the chromosome-scale pseudomolecules were aligned to the cs10 reference genome and, where necessary, reverse complemented to maintain a consistent orientation. For PR, chromosomes 2, 3, 4, 8, and 9 were reversed, and for CP, 1, 6, 7, 8, 9, and X.

Evaluation

Assemblies were evaluated for contiguity with the BBTools script stats.sh (ref. Bushnell 2018), for completeness with compleasm (ref. Huang and Li 2023) using the eudicots_odb10 5.4.6 database, and for correctness, including phasing accuracy, with Merqury, after counting 20-mers in F₀ short reads and error-corrected F₁ long reads with Meryl (ref. Rhie et al. 2020).

Analysis of the contigs’ long-read coverage was performed with Flagger (ref. Liao et al. 2023). Assemblies were screened with “yak qv”, and high-error-rate subsequences (HERS, ref. Chen et al. 2021), here defined as the basespace unable to be verified by comparison with short-read 21-mers, were compiled and exported as a BED.

Organelles

F₀ short reads identified as organellar were mapped to the Yunma-7 chloroplast and Carmagnola mitochondrion with the Geneious Prime 2023.0.4 mapper using default settings. The consensus sequences for each were generated and appended to each long-read assembly.

Diploid assembly

To test diploid-aware assembly methods, drafts were assembled in PECAT (ref. Nie et al. 2024) and Shasta (ref. Lorig-Roach et al. 2023). PECAT used the configuration for Arabidopsis (cfg_arab_ont) with some modifications. Briefly, PECAT’s block size for correction and assembly, and Minimap2’s index and minibatch size, were raised to 40 Gb to enable true all-vs-all alignment; for correction, minimum coverage was lowered to 2 for correcting and to 8 for calling SNPs for haplotypes; for assembly, the contig duplication rate was set to zero and only contigs over 4 kb were outputted; and for phasing, minimum coverage was lowered to 16. The primary assembly was purged of haplotigs with purge_haplotigs (ref. Roach et al. 2018), and the purgate was combined with the alternate assembly, which was purged a second time.

Shasta used the Nanopore-Phased-May2022 configuration, and its output was further processed to resolve haplotypes: Assembly-Detailed.gfa and parental 31-mer databases generated with KMC (ref. Kokot et al. 2017) were analyzed with GFAse (ref. Lorig-Roach et al. 2023) to produce unphased, maternal, and paternal FASTAs.

Haplotype resolution at the contig level was visualized with Merqury. These assemblies went unpolished, and so QV is not reported. Contiguity and completeness were measured as above.

Diploid-aware analyses were performed on the “pyky” node of the ZINE high-performance compute cluster at the Pontificia Universidad Javeriana, which includes 192 CPUs and 2 Tb of RAM.

Annotation

Whole genome

Gene annotations were transferred from the cs10 reference to these drafts with Liftoff (ref. Shumate and Salzberg 2021), with options “-f features.txt -chroms chroms.txt -copies -sc 0.99,” where features.txt includes all annotation types except “regions”, chroms.txt lists the most likely homolog for each pseudomolecule, based on a preliminary synteny analysis with SyRI (ref. Goel et al. 2019), and “-copies -sc 0.99” seeks to find paralogs that have at least 99% exonic identity to the primary annotation.

Cannabinoid synthases

Cannabinoid synthases were predicted ab initio in the assemblies listed in Table 2 by using the “Annotate From…” function in Geneious Prime 2023.0.4 (https://www.geneious.com), using the full-length CDS for either THCAS from Skunk #1 (ref. Weiblen et al. 2015) or the 6-3 allele of CBDAS (ref. Onofri et al. 2015), a similarity threshold of 85%, and the “All matching annotations” option. Gene clusters were then visualized in Geneious Prime.

Table 2.: Assembly statistics for PR, CP, and other Cannabis chromosome scale assemblies published since 2020.

Genotype	Punto Rojo	Cherry Pie	Abacus	Cannbio-2	cs10	JL
Reference:	This study		–	ref. Braich et al 2020	ref. Grassa et al 2021	ref. Gao et al 2020
GenBank	JBDLLE000000000	JBDLLD000000000	GCA_025232715.1	GCA_016165845.1	GCA_900626175.2	GCA_013030365.1
Long read platform	ONT		PacBio Sequel	PacBio SMRT	ONT	PacBio Sequel
Long read coverage	18x		83×	86×	36×	153×
Short read platform	pe150		pe250	–	pe150	pe150
Short read coverage	18x		234×	–	100×	118×
Assembler	NECAT		unk.	HGAP4	miniasm	Wtdbg, SMARTdenovo, Quickmerge
Long read polishing	Racon		unk.	HGAP4	3 × Racon	blasr, Arrow
Short read polishing	2 × Clair3, 4 × ntEdit		unk.	–	3 × Pilon	Pilon
Scaffolder	ntJoin		Hi-C	RaGOO	Hi-C	Hi-C
Scaffold reference	Salk assortment (Supplementary Table 1)		de novo	cs10	de novo	de novo
Linkage map	–		–	–	Skunk × Carmen	–
Haplotyping	trio-bin		unk.	–	–	–
# contigs	867	1,171	1,023	8,919	831	2,978
contig size (Mb)	740	724	796	914	714	812
contig N50 (Mb)	2.12	1.65	3.17	0.17	2.14	0.51
contig N90 (Kb)	413	349	343	46	459	126
contig max (Mb)	9.84	7.86	16.96	1.56	10.06	2.87
# scaffolds	10	10	160	147	10	483
scaffold size (Mb)	794	774	797	914	854	813
scaffold N50 (Mb)	87.6	81.1	80.6	91.5	91.9	83.0
scaffold N90 (Mb)	66.4	68.9	63.0	71.6	64.6	69.1
N %	6.71%	6.40%	0.01%	0.10%	16.34%	0.09%
BUSCO score (% single)	98.6%	94.5%	97.2%	97.1%	94.0%	92.8%
BUSCO single	2,173	2,149	2,008	1,373	2,031	1,794
BUSCO duplicated	121	49	252	886	156	364
BUSCO fragmented	7	21	9	16	16	23
BUSCO missing	25	107	57	51	123	145
Merqury Quality Value (QV)	24.41	24.38

High marks are bolded where relevant. All BUSCO scores were newly calculated in compleasm with the eudicots_odb10 5.4.6 database.

Terpene synthases

The cs10 annotations were filtered for the presence of the following descriptive terms: farnesene, geraniol, germacrene, humulene, limonene, linalool, myrcene, nerolidol, pinene, terpene, terpenoid, or terpinolene. The 47 annotations thus labelled were then transferred with Liftoff to both drafts, with stringency relaxed via “-copies -sc 0.50,” to locate any additional paralogs that have similar structure and share at least 50% exonic identity. To predict products, a custom BLAST database was built in Geneious Prime 2025.1.2 using the amino acid sequences of 33 TPS characterized via heterologous expression (ref. Booth et al. 2020). Predicted TPS were queried against this database with blastx, and in some cases, multiply aligned with Clustal Omega (ref. Sievers and Higgins 2014).

NLR genes

The NBS_712 HMM (ref. Kozik 2001), which covers the highly conserved nucleotide binding site (NBS) region of NLRs and was initially derived from the Arabidopsis genome(ref. Meyers et al. 2003), was queried with BLAST against the cs10 reference to create an initial list of candidates. These regions were extracted, aligned with Clustal Omega, and used to create a Cannabis-specific NBS Hidden Markov Model (CsNBS HMM) via the hmmbuild and hmmemit modalities of the HMMER (ref. Finn et al. 2011) software package. The DNA consensus of the HMM was then BLASTed against the PR and CP drafts, and hits, after merging overlaps, were taken as putative NLR loci. As well, the NLR-Annotator (ref. Steuernagel et al. 2020) was used to make a set of predictions, and the intersection of the two callsets was taken, so that full-length gene predictions from NLR-Annotator, verified by CsNBS HMM hits, are reported.

Repetitive elements

Each haplotype was analyzed with EDTA, the Extensive de novo Transposable element Annotator (ref. Ou et al. 2019), with setting “–force 1 –sensitive 1 –anno 1,” and incorporating the CDS from cs10 to avoid calling genes as repeats. The LTR Assembly Index (LAI, ref. Ou et al. 2018) was calculated from the EDTA output.

Comparison

Drafts of PR and CP were each scaffolded to and then aligned against the collection of pseudomolecules listed in Supplementary Table 1. Alignments were performed with Minimap2 with options “-cx asm5 –cs –eqx” and visualized as a dotplot with dotplotly (ref. Poorten 2017). The resultant PAFs were analyzed with SyRI (ref. Goel et al. 2019) with default options, and visualized as a synteny map by plotting the SyRI calls with plotsr (ref. Goel and Schneeberger 2022). The PR and CP haplotypes were also compared to each other, and visualized in Circos (ref. Krzywinski et al. 2009). The two assemblies, along with the genomes used for scaffolding and the current and prior NCBI references, were analyzed with ntSynt (ref. Coombe et al. 2024) with a minimum block size of 100 kb, and visualized with ntSynt-viz (ref. Coombe et al. 2025), with PR specified as the target genome.

Results

HMW gDNA

Each prep of one gram of young shoots provided about 4 μg high-quality DNA, with 260/280 of 1.8 and 260/230 of 2.0. Analysis via Agilent Femto-Pulse showed that this method retains many fragments over 100 kb, and the steep decline in fragments below ∼19 kb suggests that the Short Read Eliminator XL kit did function as advertised (Supplementary Fig. 1).

F₀ Illumina sequencing

The Illumina libraries yielded 52.6 M and 47.6 M read pairs for PR and CP. Filtering the reads resulted in sets mapping to the chloroplast and mitochondria, as well as to two species of Pseudocercospora. Compared to a reference mitochondrion from the hemp line Carmagnola, PR contained 197 SNPs and CP 80. Compared to a reference chloroplast from Yunma-7, PR contained 9 SNPs and CP 125. About 0.5% of reads mapped to Pseudocercospora, with P. musae appearing to be about 50% more abundant than P. fijiensis in both F₀s (data not shown). After trimming and decontamination, 85.0 and 81.1% of base space remained, providing 16.7 × and 14.4 × of coverage for polishing.

F₁ Oxford Nanopore sequencing

The PromethION cell yielded 34.6 Gb of data, with an N₅₀ of 23.6 kb, and 15.5% of bases contained in reads over 50 kb.

Estimation of genome size

Estimates of genome size were derived from both short and binned, corrected long reads. Results from findGSE are summarized in Table 1.

Table 1.: Genome size estimates in Mb, derived from F₀ short and binned, corrected F₁ long reads.

readset	findGSE (hom)	error-excluded	findGSE (het)	error-excluded
PR-ilmn	857.661	823.229	857.661	819.607
PR-ONT-bin-corr	827.774	784.051	fail	fail
CP-ilmn	78.037	42.676	994.974	955.885
CP-ONT-bin-corr	794.321	751.697	fail	fail

Assembly

Assembly statistics, including the two drafts presented here, a recent NCBI upload, and three previously published chromosome-scale long-read assemblies (ref. McKernan et al. 2018; ref. Gao et al. 2020; ref. Grassa et al. 2021), are summarized in Table 2.

Trio binning

The “classify_by_kmers” script produced a PR bin containing 17,605 Mb of sequence in 1,238,187 reads, and a CP bin containing 15,942 Mb of sequence in 1,156,998 reads. The unknown bin contained 889 Mb in 322,224 reads, which did not assemble into contigs and were not analyzed further. The split among PR, CP, and unknown was 51.1%, 46.3%, and 2.6%. After assembly and polishing, the switch rates for PR and CP were estimated by Merqury as 1.00 and 0.62%, kmer completeness was 97.81 and 98.25%, and the content of other-parent hampers was 0.29 and 0.32%.

Contiguity

The drafts of Punto Rojo and Cherry Pie contain 867 and 1,171 contigs, with N₅₀ of 2.12 and 1.65 Mb, N₉₀ of 413 and 349 Kb, and a longest contig of 9.84 and 7.86 Mb.

We verified the integrity of the contigs with Flagger, which identified 66 and 178 potential error regions in PR and CP (Supplementary Table 2 and *-flagger.bed annotations), of which 64 and 177 were at the ends of contigs, where a drop in coverage is not unexpected. To evaluate the 3 intra-contig error regions, of which one contained one gene and two were non-genic, we ran BLAST queries with the closest genes on either side, which confirmed that gene order was conserved (relative to ERBxHO40_23, data not shown), and so we have elected to leave them in their original state.

Scaffolding PR and CP with ntJoin resulted in placement of 97.7 and 96.4% of contig sequence on the 10 chromosome-scale pseudomolecules, with N content of 6.71 and 6.40%.

Completeness

PR and CP have compleasm BUSCO scores of 98.6 and 94.5%, with duplication ratios of 5.2 and 2.1%. The full BUSCO output is summarized in Table 2 and Fig. 3.

PMC12869074 – jkaf286-F3 — **Fig. 3.:** Completeness of long-read Cannabis assemblies. BUSCO scores are expressed in percent of total plant orthologs, with different colour labels for single, duplicated, fragmented and missing genes. Previously published assemblies were newly evaluated with the eudicots_odb10 5.4.6 dataset.

Correctness

For PR and CP, Merqury estimates QV at 24.42 and 24.35, corresponding to base level precision of 99.64 and 99.63%. Yak QV annotated 19.85 Mb and 21.67 Mb in 801k and 884k high-error-rare subsequences, with the large majority of HERS (736k and 815k) being under 50 bp (*-yak-hers.bed annotations), and just 1 and 2 being over 1 kb.

Diploid assembly

PECAT + purge_haplotigs produced a primary and an alternate assembly. Shasta produced a diploid draft, which was subsequently binned by GFAse into maternal, paternal, and unknown compartments. The size, contiguity, and completeness are reported in Table 3.

Table 3.: Assembly statistics for trio-binning, PECAT, Shasta, and Shasta + GFAse using F₀ kmers.

Total size (Mb)	Contigs	N50 (Kb)	BUSCO-total	BUSCO-duplicate
triobin_pr	740	867	2,120	98.60%	5.60%
triobin_cp	724	1,171	1,650	94.50%	2.90%
PECAT_pri	723	464	2,336	96.70%	4.47%
PECAT_alt	717	1,556	908	90.70%	3.57%
Shasta-dip	877	742,931	65	94.24%	60.20%
GFAse-mat	435	3,249	234	83.87%	1.07%
GFAse-pat	439	3,250	235	83.88%	0.86%
GFAse-unphased	272	24,384	51	16.25%	4.43%

Haplotype separation was visualized in Merqury, based on per-contig counts of parental short-read 20-mers as tabulated by Meryl (Fig. 4).

PMC12869074 – jkaf286-F4 — **Fig. 4.:** Merqury plots, where the X and Y axis represent the number of unique PR and CP 20-mers. Please note that scaling varies among drafts.

Annotation

Liftoff

Nearly all of the reference annotations were able to be placed on both drafts. Table 4 summarizes the drafts’ annotations.

Table 4.: Accounting of cs10 annotations transferred to PR and CP with Liftoff, with option “-copies 0.99”.

cs10	CP	CP (%)	PR	PR (%)
gene	29,807	29,008	97.32%	29,851	100.15%
pseudogene	1,363	783	57.45%	824	60.45%
mRNA	33,639	31,804	94.55%	32,854	97.67%
CDS	33,674	31,734	94.24%	32,823	97.47%

CN synthases

The primary location for CN synthases, which includes 6 to 13 paralogs with identity from 85.3 to 99.9%, is the previously identified B locus (ref. de Meijer et al. 2003; ref. Grassa et al. 2021) on chr7, which varies in size, location, and copy number among assemblies (Supplementary Table 4 and Fig. 5). We note here that JL numbers its chromosomes in order of length, so that its chr1 is the homolog of chr7 in cs10 and the other listed assemblies. Because PR and JL do not include a CBDAS above 95% identity, and CP and Abacus do not include a THCAS above 95% identity, we report only the relevant CN synthase query and homology scores for paralogs of the putative active gene, which in all cases shares >99% identity with the query. However, we note that in no case does a query with the other CN synthase return a different copy number (data not shown). Because Cannbio-2 is a pseudohaploid representation of a B_D/B_T genotype, its results are reported for both queries.

PMC12869074 – jkaf286-F5 — **Fig. 5.:** Visualization of the a) Bt and b) Bd alleles on chromosome 7 from the published assemblies and the haplotypes used for scaffolding (BOAXa for PR and GRb for CP). The active synthase is marked in green, while inactive paralogs are in red.

The arrangement of CBDAS copies appears to offer more variability. While most drafts contain all synthase copies in one cluster of 5 Mb or less, CP has two clusters, both on chr7: a group of 5 containing the active synthase at 61.7 to 62.7 Mb, and a group of 8 paralogs with 88 to 89% identity that spans from 39.8 to 40.9 Mb. The Golden Redwood B haplotype to which it is scaffolded appears similar, but contains 7 and 10 copies in similarly situated clusters.

TPS

The annotations transferred from cs10 were mined for descriptions that included the name of a terpene. 45, 41, and 47 TPS were located in PR, CP, and cs10 (Supplementary Table 3). The TPS are unevenly distributed, with clusters of monoterpene or diterpene synthases lying in distal regions of chromosomes 5, 6, and 9. We denote these as Major Terpene Clusters (MTC, Table 5), defined here as a group of at least 4 TPS genes separated from one another by no more than 2 Mb.

Table 5.: Terpene synthases found in clusters.

Genotype	chr	Start	Stop	TPS	Content
PR	5	0.9 Mb	2.6 Mb	10	3 × TPS10, 3 × Myrcene, 2 × Limonene, 2 × Myrcene
CP	5	1.4 Mb	2.7 Mb	10	3 × TPS10, 3 × Myrcene, 2 × Limonene, 2 × Myrcene
PR	6	80.4 Mb	82.9 Mb	9	2 × Humulene, 4 × Germacrene, 3 × Humulene
CP	6	75.3 Mb	78.3 Mb	10	2 × Humulene, 5 × Germacrene, 3 × Humulene
PR	9	59.2 Mb	59.4 Mb	5	5 × probable monoterpene synthase
CP	9	62.9 Mb	63.0 Mb	4	4 × probable monoterpene synthase

The TPS10 triplet in MTC5 includes one TPS10 and two TPS10-like predictions in both haplotypes.

To corroborate the predicted products, we queried a custom BLAST db, composed of 33 TPS characterized by heterologous expression, with the CDS of TPS found in cs10, PR, and CP. Where a gene contains multiple isoforms, we took isoform X1. To quantify similarity, we report the “Grade”, a proprietary metric within Geneious Prime that incorporates the length, e-value, and percent identity of the hit (Supplementary Table 4). We identified two notable polymorphisms in MTC5. The cs10 gene XP_030500628.1, predicted as “(−)-limonene synthase, chloroplastic like,” was polymorphic, with cs10 and CP having the best (99.8%) hit to CsTPS14: Canna Tsu (−)-limonene, while PR best matched (99.2%) to CsTPS1: Skunk (−)-limonene. Aligning the limonene synthases revealed, among other polymorphisms, a proline-serine transversion shared between PR and Skunk (Fig. 6).

PMC12869074 – jkaf286-F6 — **Fig. 6.:** Clustal Omega alignment of limonene synthases from cs10, PR, CP, Canna Tsu, and Skunk.

Within the same MTC, we also found that the cs10 gene XP_030501051.1, a predicted “myrcene synthase, chloroplastic,” in all cases matched to CsTPS15: Canna Tsu Myrcene; however, the Grade in cs10 and CP was quite good (96.7 and 96.6%), while in PR the Grade was much lower (75.7%). Aligning these synthases revealed several nonsense mutations in the PR allele (Fig. 7).

PMC12869074 – jkaf286-F7 — **Fig. 7.:** Clustal Omega alignment of myrcene synthases from cs10, PR, CP and Canna Tsu.

NLRs

We report 227 results in PR and 240 in CP, all of which are placed on the 10 chromosomes. Many of these predictions occur in clusters, which we call Major Resistance Clusters (ref. Christopoulou et al. 2015). Due to their more abundant and diffuse nature, we forego a formal definition and instead rely on a simple visual inspection. Typically, clusters have 5 or more members and an NLR density of at least one NLR per 2 Mb.

In PR, 176 NLRs are found in 9 clusters, and in CP, 188 in 11 clusters, representing 77.5 and 78.3% of the total (Table 6). While most MRC have similar location and copy number between drafts, MRC1a has 8 NLRs in PR compared to just 2 in CP, and MRC5 and MRC7, which contain 4 and 14 NLRs in CP, appear to be absent from PR.

Table 6.: Location and copy number of major resistance gene clusters.

PR			CP
MRC	Start	Stop	NLRs	Start	Stop	NLRs
1a	37,732,869	38,500,894	8	33,106,874	33,180,695	2
1b	65,934,737	87,214,810	18	62,988,874	68,532,952	11
2a	3,093,234	3,150,113	5	1,080,960	11,54,696	2
2b	87,474,025	88,351,507	16	79,836,788	80,842,432	18
3a	39,013	4,472,107	49	8,181	7,630,351	48
3b	76,654,694	85,346,627	20	79,166,000	81,714,877	20
5				80,286,892	80,574,472	4
6a	804,274	9,998,555	30	806,742	10,013,473	35
6b	56,255,482	83,005,153	16	56,615,863	78,421,634	14
7				60,457,601	73,427,240	14
9b	66,855,455	67,250,977	14	69,189,104	71,136,706	20
TOTAL			176			188

Repetitive elements

We summarize EDTA and LAI results, and include for comparison EDTA results from the Salk Institute Pangenome (ref. Lynch et al. 2025), which represent the average of 193 assemblies (Table 7).

Table 7.: Quantification of repeat element composition of Punto Rojo, Cherry Pie, and the average of the Salk Institute assemblies.

PR	CP	Salk
LINE
L1	1.56%	2.00%	NR
LTR
Copia	12.16%	13.72%	16.27%
Gypsy	16.33%	11.62%	19.70%
Unknown	32.53%	35.02%	16.51%
TIR
CACTA	0.97%	1.31%	3.12%
Mutator	1.97%	2.98%	6.03%
PIF_Harbinger	0.49%	1.25%	1.09%
Tc1_Mariner	0.07%	0.02%	0.37%
hAT	0.97%	0.93%	1.95%
nonTIR
helitron	1.68%	1.52%	2.84%
repeat_region	2.70%	2.23%	NR
Total	71.41%	72.60%	67.89%
LAI
Raw	23.70	23.12	NR
Final	18.84	19.22	NR

NR: not reported.

Comparative genomics

PR and CP were scaffolded to and then aligned against the set of chromosome-scale pseudomolecules shown in Supplementary Table 1. SNPs and larger variants are summarized in Table 8.

Table 8.: Structural and sequence variation as reported by SyRI for PR and CP.

#Variation_type	PR vs Salk			CP vs Salk			PR vs CP
#Variation_type	Count	Length_ref	Length_qry	Count	Length_ref	Length_qry	Count	Length_ref	Length_qry
#Structural annotations
Syntenic regions	2,932	506 Mb	521 Mb	1,295	662 Mb	671 Mb	2,947	488 Mb	493 Mb
Inversions	81	9.50 Mb	9.80 Mb	48	1.93 Mb	2.02 Mb	88	44.7 Mb	48.5 Mb
Translocations	3,208	39.5 Mb	39.4 Mb	866	20.6 Mb	20.5 Mb	3,077	45.6 Mb	45.3 Mb
Duplications (reference)	451	3.7 Mb	–	126	1.3 Mb	–	761	9.45 Mb	–
Duplications (query)	1,568	–	8.0 Mb	1,169	–	7.2 Mb	1,877	–	10.4 Mb
Not aligned (reference)	5,644	176 Mb	–	1,709	48 Mb	–	5,782	219 Mb	–
Not aligned (query)	7,387	–	219 Mb	3,146	–	74 Mb	7,440	–	181 Mb
#Sequence annotations
SNPs	2,319,292	2.32 Mb	2.32 Mb	2,356,727	2.36 Mb	2.36 Mb	2,527,168	2.53 Mb	2.53 Mb
Insertions	188,373	–	2.37 Mb	242,894	–	1.91 Mb	285,443	–	2.68 Mb
Deletions	290,662	2.99 Mb	–	364,637	2.04 Mb	–	246,441	2.22 Mb	–
Copygains	141	–	0.91 Mb	90	–	0.58 Mb	169	–	2.20 Mb
Copylosses	133	0.55 Mb	–	85	0.53 Mb	–	162	1.70 Mb	–
Highly diverged	38,222	273 Mb	288 Mb	49,826	165 Mb	174 Mb	43,338	317 Mb	325 Mb
Tandem repeats	14	0.01 Mb	0.02 Mb	8	0.00 Mb	0.00 Mb	10	0.02 Mb	0.02 Mb

To visualize macrosynteny, dotplots were generated for each draft relative to its scaffolding substrate (Supplementary Figs. 2 and 3), and common kmers were visualized with ntSynt (Fig. 8).

PMC12869074 – jkaf286-F8 — **Fig. 8.:** Alignment of 24-mers found in all input assemblies. In addition to PR and CP, we include the genomes whose chromosomes contributed to the scaffolding substrate (Supplementary Table 1), and also Pink Pepper (PP) and cs10, the current and previous NCBI reference assemblies. Black arrows indicate reverse-complemented chromosomes.

Variation between the two haplotypes was plotted with SyRI and plotsr (Fig. 9), and a Circos plot was generated that, in addition to synteny and interchromosomal translocations, includes tracks for contig boundaries, gene density, and the location of TPS and NLR genes (Fig. 10).

PMC12869074 – jkaf286-F9 — **Fig. 9.:** Synteny and rearrangement between PR and CP homologs, filtered above 100 kb.

PMC12869074 – jkaf286-F10 — **Fig. 10.:** Circos plot showing, from center outwards, homologous regions (grey) and interchromosomal translocations (green), both filtered above 25 kb, contig boundaries (blue), gene density (heatmap, where red is high and blue is low), and NLRs (red) and TPS (purple).

Discussion

HMW gDNA prep

Our method produced DNA of adequate length and substandard purity. Given the low yield of 34 Gb, it would be beneficial to refine the technique further, as recent reports indicate that PromethION yields of over 100 Gb are now possible (ref. Belser et al. 2021; ref. van Rengs et al. 2022). Following nuclei isolation, performing the organic extraction with phenol:chloroform (ref. Zerpa-Catanho et al. 2021), in place of mere chloroform, may provide for more efficient removal of carbohydrates and proteins. As well, dark incubation of the shoots for 3d before purification may reduce carbohydrate content (ref. Li et al. 2020).

The decision not to fragment the HMW DNA surely decreased yield, due to accelerated nanopore failure when reading ultra-long fragments (ref. Wang et al. 2021b). However, as the cannabis genome is known to be littered with repeats of 30 to 45 kb (ref. Grassa et al. 2021), the 4 × of ultra-long (>50 kb) coverage found here is likely sufficient to resolve some of the long repeats that might falsely collapse, or fail to extend, in the absence of ultra-long coverage. Therefore, unfragmented DNA appears to be the optimal use of the ONT platform, with the caveat that sequence yield is a function of purity.