129 min read

Nanopore Notes

Nanopore Notes

People ask me a few things from time to time about nanopore sequencing. I've started collating my responses and presentations, which you can find here. Consider this an "AQ", rather than a "FAQ".

Walkthroughs / Tutorials / Protocols

Nanopore Papers (peer reviewed, with me as a co-author)

  • Tree Lab in Sub-Saharan Africa paper
  • Learning how to get good yield paper
  • Humboldt colloquium eBook, chapter 5
  • Nippo whole genome paper
  • Chimeric reads from amplicons paper
  • R9 E. coli sequencing paper
  • Nippo mitochondrial genome paper
  • Multi-lab E. coli sequencing paper
  • Influenza paper
  • Mitochondrial transfer paper

Nanopore Posters (not peer reviewed)

  • Transcript expression changes (QRW) poster
  • Cells with/without mtDNA (LC) poster
  • Nippo whole genome (QRW) poster
  • Nippo whole genome (LC) poster
  • Nippo mitochondria (AGTA) poster
  • Music from DNA (LC) poster
  • MinION sequencing (WHBRS) poster
  • Rhapsody of errors (LC) poster

Nanopore Presentations (not peer reviewed)

  • Chimeric Reads and Where to Find Them slides
  • Reflections on repetitive repeats with REPAVER slides
  • A pore update (MIMR) slides
  • Women In Nanopore Sequencing (Otago) slides
  • Sequencing from food to figures (linux.conf.au) video; data
  • Repetitive sequences (QRW) slides
  • Update on nanopore sequencing (MIMR) slides
  • Repetitive sequences (MIMR) slides
  • Cells with/without mtDNA (MIMR) slides
  • Private sequencing in a public world (Humboldt) slides
  • Nippo whole genome (LC Lightning talk) slides
  • The jungle of MinION data (PoreCamp AU) slides
  • Bioinformatics Institute seminar (AUT) video
  • Sensor-stimulating Sequencing (TEDxWellington) video
  • Rhapsody of errors (NZ NGS) slides

Clive Brown's Presentations

Nanopore Community Meeting / London Calling Presentation notes

Email / message discussions

The following bits and pieces are collected from piecemeal discussions that I've had with various people. The comments represent only my side of the discussion, with some additional changes where clarification is necessary.

2024-Mar-25

Short Read Sequencing / Flow Cell Yield

Nanopore sequencers are high-throughput sequencing devices that can handle a extremely wide range of read lengths, from a few tens of bases in "short read mode" (ONT's software claims 20bp) up to a few megabases. The R10.4.1 flow cells have a channel sequencing rate that is fixed (through chemistry and temperature settings) at about 400 bases per second, so the theoretical maximum sequencing yield is about 35 Gb over the course of a 2-day run, but a more typical yield in well-prepared flow cells (from experienced users) is 5-15 Gb per run, with the lower yields typically happening for both short and really long reads for different reasons. For short reads, switching time (between one sequence finishing and the next starting) has a big impact on sequencing yield. Beyond just the time it takes to do that switching (when the pores are not sequencing anything), the open pores during the switching period will draw lots more carrier ions through, which quickly depletes the flow cell. Assuming a yield of 1 Gb for a first run, and every read 250bp (i.e. 200bp + 50bp for adapters), I would expect around 4 million reads to come off the sequencer, with a mean of about 20,000 reads per amplicon if there are 200 amplicons that are loaded at identical concentrations. 20,000 times average coverage gives a lot of room for sample preparation errors before it begins to impact the outcome of most downstream analyses. Processing reads with different composition is something that downstream analysis tools do really well. A typical cDNA sequencing run will have millions of reads from tens of thousands of different transcripts that are binned and separately processed. A 200-amplicon run with easily-distinguishable compositions is a simpler form of that. I'm leaning on the "easily distinguishable composition" quite heavily here. If there's a chance that amplicons could be confused, then they need to have some other distinguishing factor added (e.g. native barcode adapters, or additional primer tails).

2024-Mar-24

cDNA read demultiplexing

My demultiplexing scripts have been detecting SSP-based UMIs and adding them to the demultiplexed fastq metadata for a while now, but I've finally gone the rest of the way and produced a UMI lookup table as one of the default outputs. That table can be used to inject UMIs into the 'RX' field of SAM alignments using another accessory script that I wrote:

# Set up shell variable to script location
$ scripts="~/bioinfscripts"

# Demultiplex reads
$ ${scripts}/fastq-dental.pl called_${x}.fastq.gz -barcode ${scripts}/dental_db/barcode_full_PBK114.fa -adapter ${scripts}/dental_db/adapter_seqs.fa -mat PCB114.mat

# Filter out unnecesary barcodes manually using nano
$ cp demultiplexed/counts.orig.txt demultiplexed/barcode_counts.txt
$ nano demultiplexed/barcode_counts.txt

# Orient reads and generate per-barcode UMI tables
$ for bc in $(awk '{print $2}' demultiplexed/barcode_counts.txt);
     do echo ${bc};
     ${scripts}/fastq-dental.pl -orient demultiplexed/reads_${bc}.fq.gz \
          -muxdir demultiplexed/oriented_${bc};
   done

# Map reads to the transcriptome, and inject UMIs after mapping
mkdir -p mapped
for bc in $(awk '{print $2}' demultiplexed/barcode_counts.txt);
   do echo ${bc};
   minimap2 -t 10 -a -x map-ont reference/gencode.vM34.transcripts.fa.idx \
     demultiplexed/oriented_${bc}/reads_BCoriented.fq.gz | \
     ${scripts}/samAddUMIs.pl -umi demultiplexed/oriented_${bc}/UMIs_all.txt.gz | \
     samtools sort > mapped/reads_${bc}_oriented_vs_M34-t.bam;
done

I presume somewhere out there are downstream tools that will process the RX tags to do per-transcript molecule filtering, but it's unlikely that's done for any nanopore-specific mappers (e.g. bambu, oarfish) because UMIs have only been a recent addition to the nanopore sequences.

Aside: putting a bit of work into training the read aligner (rather than the basecaller) can help substantially in improving the apparent accuracy of reads. My demultiplexing scripts use last-train to work out appropriate parameters for read alignment, and are here shown to increase the substitution identity for barcode/adapter detection from 99.8797% (q29) up to 99.9445% (q33):

Parsing barcodes and adapters... done; successfully generated LAST database.
.......... Training LAST using first read batch (10000 reads):
  # substitution percent identity: 99.8797
  # substitution percent identity: 99.9309
  # substitution percent identity: 99.9349
  # substitution percent identity: 99.9356
  # substitution percent identity: 99.9357
  # substitution percent identity: 99.9358
  # substitution percent identity: 99.9358
  # substitution percent identity: 99.9445
Done - trained first read batch in 1.1 seconds.

2024-Mar-23

Native barcoding / First Steps

As a general rule for science experiments, ask first before buying reagents, and especially before using (or buying) expensive machines. Do a lambda run first. That will help you understand the timings, and what a good sequencing run looks like. It will work out cheaper in both time and cost, because you won't be fussing around trying to correct things that are part of normal sequencing, or not correcting things that are unexpected. I don't really understand why so many people underestimate the complexity of sample prep and data analysis to such a degree. Lambda sequencing is simple; it's a single sample, and a single moderately long sequence of known length. 200 amplicons is complex; 200 different amplicons is (a bit) easier than a single amplicon in 200 samples. Flow cell washes are complex. Barcoding is complex. For nanopore, a short read sequencing run (< 1kb) is complex. If at all possible, get someone else who is familiar with nanopore sequencing to do the sample prep the first time, and watch them. This will help you understand the many protocol issues there are which are unconsciously glossed over by experienced users. Failing that, make your sequencing needs simpler, especially for a first sequencing run. I understand that most of these will involve a lot of waiting, or might not be possible (e.g. if samples have already been prepared); that's why it's important to ask the questions about the experiment before paying money for reagents and sequencing:

  • Flow cell washes (especially for amplicon or plasmid runs) can be avoided by buying Flongle flow cells (adapter is $1460 USD and comes with 12 flow cells; after that it's $810 per 12 flow cells).
  • If amplicons can be redesigned to be longer, then do that. If you can get them to 400bp or longer, then you'll have fewer issues with short reads rapidly depleting the flow cell sequencing capacity. Depending on applications, you may also be able to use the rapid barcoding [fragmentation] kit instead of the native barcoding kit, which is a simpler, quicker sample preparation process.
  • When sequencing lots of samples, consider redesigning primers to incorporate barcode sequences. For 200 samples, you could do it all in one sequencing run using the 24-barcode NBD kit if you had 12 separate primers with unique barcodes on them. Downstream analysis will be more complex, but the sequencing will be easier.

But... assuming none of that can be done, and you want to leap head-first into this, here are some tips for first runs from short reads:

  • Short reads (especially < 200bp) will use the flow cell up faster. As a consequence, sequencing yield will be reduced, and this cannot be recovered from flow cell washes. Expect sequencing output to drop substantially across multiple washes.
  • One way to alleviate the depletion issue is to spike in a longer sequence, e.g. add 10% lambda DNA to your sample. This will keep pores occupied when they're not being used for sequencing the target DNA, which will increase the life of the flow cell and the yield of the samples.
  • Your first library preparation will take a lot longer than what is mentioned in the protocol. There are "one-pot" approaches (where sample is kept on beads rather than being transferred into new tubes) which will slightly reduce time, but a first run with 24 samples will take ages. Block out at least half a day for sample preparation; all the reagent hunting, tube switching, checking protocols, and pipetting will take a long time.
  • As far as I'm aware, NEB reagents have produced the best results according to ONT's internal research. Going off-protocol for a naive first attempt at nanopore sequencing is a bad idea. There's a huge amount of detail in the on-line methods, and it's quite easy to accidentally miss the most important things (e.g. keeping ethanol away from the adapters) because the protocols make it seem like everything is important.

2024-Mar-16

RBK Multiplexing / Serial Sequencing

Long DNA molecules (>100 kb) won't kill a pore, but they will stop it from sequencing temporarily. A long sample prep combined with regular use of the nuclease wash kit will allow sequencing to continue for a long time.

Pore life is degraded faster by the pores staying unoccupied during a sequencing run, which happens more often with short molecules (<1 kb).

However, if you have different-sized RBK libraries, you'll get higher yield by running them in equimolar concentrations at the same time. One way to achieve this is to make a rough guess, sequence for half an hour, then wash the flow cell and re-adjust the pool (i.e. re-pool, adapt, mix with sequencing buffer) based on those results.

2024-Mar-13

cDNA Sequencing

The cDNA synthesis steps add ONT anchor sequences onto the template sequences. These allow the ONT universal rapid barcode primers to bind and extend. The strand-switch primer includes an additional unique molecular identifier (UMI; represented by 'V' in the SSP sequence below) to help with read deduplication during post-sequencing QC, and also includes 3 2' O-methyl RNA Gs to anchor on to the end of sequences that have been created using a reverse polymerase:

SSP 5' - TT-TCTGTTGGTGCTGATATTGC-TTTVVVVTTVVVVTTVVVVTTVVVVTTT-mGmGmG - 3'
VNP 5' - AC-TTGCCTGTCGCTCTATCTTC-TTTTTTTTTTTTTTTTTTTTVN - 3'

[note: the initial TT/AC were present in earlier barcode primers, but not the primers in the newest V14 kits]

The TTT...TTTVN region of the VNP can be replaced by gene-specific sequences if targeted amplification is desired, bearing in mind that the reverse-transcription reaction happens at 42ºC.

If making these primers yourself (which I would not advise for SSP, due to its more complicated structure), bear in mind that they should be HPLC purified. Based on various protocols, I believe the primers should be diluted / reconstituted to 2µM. Adding an additional 5' phosphate to the primers may also be a good idea, as this will allow them to then be directly ligated to ligation adapters using the ligation sequencing kit.

2024-Mar-07

Flongles should be loaded with 5-10 fmol, according to the LSK amplicon protocol:

Prepare your final library to 5-10 fmol in 5 µl of Elution Buffer (EB). Important: We recommend loading 5-10 fmol of this final prepared library onto the R10.4.1 flow cell.

Flow cells can be overloaded, which is a particular problem when the input DNA is long and unfragmented. Long DNA can knot up and block pores, requiring wash kits to digest the DNA to clear the pores (assuming the enzymes can get access to the DNA). Too much input DNA also means that not enough will be adapted, meaning that the unadapted DNA competes for pores, and the effective loading fraction will be reduced - this is more of a problem with the rapid adapter kits, where the adapter is a limiting factor in the chemistry.

ONT doesn't believe molarity is important for transposase kits. I do, especially when reads are short enough (which is certainly the case for the 1200bp amplicons in the midnight kit).

Regardless, depth is a matter of sequencing time, not loading concentration. The flow cell should be loaded with DNA that has a high enough molarity (or concentration) that it can sequence at its maximum speed, regardless of how many amplicons are being sequenced.

For planning and grant purposes, assume that the Flongle flow cell will produce 200 Mb of output, and work off that. For example with 50 amplicons, that means that the budget per amplicon is 4 Mb of sequence on average, which gives 3000X worth of coverage using 1200bp amplicons (e.g. 15 samples at 200X average coverage).

2023-Nov-13

Advice for first-time users

  • Sequence lambda first, ideally with the ligation kit if you can stomach the cost of ligation reagents. It will likely be a good run, and will help show you what a good run looks like. A 6-hour run with LSK114 settings is fine; the flow cell can be washed and re-used (using the flow cell wash kit) after the lambda run is over.

  • Flow cell degradation over the course of a run is normal. The degradation can be more or less extreme depending on the length of sequences going through the pore (longer is usually better, up to about 50kb), the number of pores that are occupied and sequencing (more is better), sample contamination, flow cell quality, running temperature, applied voltage, and probably a few other things.

  • Because flow cell and pore degradation is normal, do not expect a flow cell wash to recover all the pores on the flow cell. The wash kit doesn't replenish flow cells back to full sequencing capacity; it just removes the sequencing library and a bit of debris (including DNA) that might be blocking pores. The flow cells contain biological proteins that can be destroyed, and destroyed pores will not come back to life after a flow cell wash.

  • The time you run a flow cell for is experiment dependent. 10-20kb sequences will take less than a minute to get through the sequencing pores (at 400 bases per second), so unless you're sequencing megabase-length DNA, you probably don't have to worry about run time interfering with the ability to sequence DNA. A flow cell can be run for anything from about 5 minutes (to give sequences a chance to get to the pores) up to 72 hours (or longer, in the rare circumstances that the flow cell survives for longer). I usually stop MinION runs once they go below about 10 pores actively sequencing, as there's not much additional benefit gained from sequencing beyond that point.

2023-Nov-01

Bacterial genome assembly

Running multiple samples on the same flow cell is possible using barcoding. The most commonly used options are rapid barcoding (cheaper per run/sample), and native ligation barcoding (more yield per run/sample, potential for higher accuracy). If you don't have much DNA for each sample (unlikely for bacterial isolates), then rapid PCR barcoding can be used instead (shorter reads, but should be plenty long enough for most stuff that gets done with bacteria).

Ryan Wick (from Kathryn Holt's lab) has been doing a lot of research with different base calling models and assemblers for bacterial assembly. Their lab has produced a tutorial on how to assemble bacterial genomes:

https://github.com/rrwick/Perfect-bacterial-genome-tutorial/wiki

Looking at his most recent accuracy post might be helpful as well:

https://rrwick.github.io/2023/10/24/ont-only-accuracy-update.html

2023-May-27

Plasmid sequencing tips

When preparing plasmid DNA for sequencing, I follow one piece of advice that Dr. Divya Mirrington (ONT) gave me about pooling: create a pooled sample with volumes that you're confident about, then remove an aliquot from that for adding the remaining reagents [paraphrased]. I don't do any additional cleanup for purified plasmid DNA; they tend to sequence very well on flow cells without that cleanup.

My key requirement for plasmid sequencing is a concentration of >20 ng/μl (ideally by qubit or quantus). Concentrations over 100 ng/μl should be diluted down. If the plasmids can be diluted down to all exactly the same concentration (but at least 20 ng/μl), or they're all similar lengths, then that makes creating an equimolar pool much easier.

For more details, see my Plasmid Sequence Analysis protocol.

According to ONT's RBK114 gDNA protocol...

We recommend loading a maximum of 200 ng of the DNA library for Flongle flow cells. If necessary, take forward only the necessary volume for 200 ng of DNA library and make up the rest of the volume to 5.5 µl using Elution Buffer (EB).

It does surprise me a little that this amount doesn't seem to have changed for RBK114 vs RBK004. The newer kits are meant to be more sensitive, and may perform better (with less blocking) if less DNA is loaded onto the flow cell.

2023-Apr-03

Library concentration with rapid adapters

The thing that matters most is the number of adapted molecules going onto the flow cell. This molecule count can be reached any number of ways, so for more samples the amount added per sample can be reduced.

With the rapid kits, the amount of adapter added is the limiting factor, and there should always be an excess of molecules with adapter bonding sites in the prepared library. For this reason, it's more important to make sure that samples are loaded in equimolar quantities than any particular concentration (although I have found that 20-100ng/µl seems to work well for most situations with the rapid kits).

2023-Apr-01

Almost any nanopore kit can be used for metagenomics if you're brave enough.

It used to be the case that non-barcoded kits were $600, and barcoded kits were $650. The standard ligation kit (SQK-LSK114) is still $600, but the variation in barcode count and volume now means that other kit costs are less predictable. Prices are always available on the Nanopore store (no login needed):

https://store.nanoporetech.com/productDetail/?id=ligation-sequencing-kit-v14

The ONT kits and flow cells are only part of the cost for metagenomics, and for the ligation and native barcoding kits you'll need to purchase additional reagents for sequencing, in particular ligation enzyme, and end repair / tailing enzyme.

For our river water studies, we used a custom enzyme cocktail to break open the cells, followed by a Qiagen Power water kit to extract DNA, followed by the rapid PCR barcoding kit for sequencing:

https://doi.org/10.1093/gigascience/giaa053

Nowadays Sigma sells Metapolyzyme ($491 NZD), which is a pre-prepared enzyme mix that does a better job than the cocktail we used:

https://www.sigmaaldrich.com/NZ/en/product/sigma/mac4l

The Power water kit for DNA extraction is here ($671). Depending on your application, there may be a better kit (e.g. PowerSoil), or you might be able to get away with a simpler DNA extraction protocol:

https://www.qiagen.com/us/products/discovery-and-translational-research/dna-rna-purification/dna-purification/microbial-dna/dneasy-powerwater-kit?catno=14900-50-NF

The rapid PCR barcoding kit is here ($650). Note that this uses Kit9 adapters, so will only work with R9.4.1 flow cells, either MinION flow cells ($900 each), or Flongle flow cells ($810 for 12, or $1500 for the Flongle adapter (a one-time purchase) plus 12 flow cells), depending on the required yield. The rapid PCR barcoding kit doesn't include DNA polymerase, so that will need to be purchased as well if you don't have it:

https://store.nanoporetech.com/productDetail/?id=rapid-pcr-barcoding-kit

https://store.nanoporetech.com/productDetail/?id=flongle-flow-cell-pack

As ONT is in the process of shifting over to kit14 and R10.4.1, I expect that the rapid PCR barcoding kit will eventually be replaced with a kit14 equivalent. The cost of the kit should be the same, though.

In addition to that, you'll need to purchase other general lab consumables and have access to the required general lab equipment (and staff labour) for doing all these things. Beyond that, there file storage costs and computation costs (both for sequencing and data analysis) that may need to be considered, especially if you've never done any sequencing before.

2023-Mar-31

Rapid barcoding

When pooling multiple samples together on the same run, one easy approach is to dilute all samples to 20ng/μl, create a pool with equal amounts for all samples, then take 10μl from the pool.

If one of the samples has a lower concentration than 20 ng/µl then more maths and more pipetting of small[er] volumes is needed.

What I find the easiest to think about is working out how much of the others I would need to add to equalise concentration to 1 µl of the lowest-concentration sample.

As one example, let's say there are 12 samples, and one of them is at a concentration of 12 ng/µl, and the others are all at 20 ng/µl. So the amount in 1µl of the 6 ng/µl sample is the same as 0.6µl in the other samples. If there is 10 µl of everything, then I would add 10µl of the 12 ng/µl sample to 6µl of all the other samples*, and then take 10µl from the resultant pool.

The most important thing to get right is equimolar concentration for all samples. For targeted sequencing (like 16s), this is easier because the same length can be assumed for all samples, so equimolar = equal concentration = equal ng quantity in an arbitrary volume.

* or proportionally less for all samples, depending on how much excess liquid I want and how confident I am about small-volume pipetting

2023-Mar-26

Amplicon / plasmid sequencing on Flongle flow cells

Even when we have wildly different read counts for 12 samples, we'll usually get 100X coverage from the lowest count sample after an overnight Flongle run. 100X coverage for the highest count sample is often reached in about half an hour. I think 24 samples per run is feasible on a Flongle, especially if low-coverage samples can be re-run separately on another flow cell. More than that is getting in the realm of what might be better using a 96-barcode kit and a MinION flow cell (but it still might be worth making the attempt on Flongle first).

My Flongle yield estimate for the purpose of funding and experiment planning is 200 Mb. If an experiment can work with 200 Mb yield, it's probably going to be okay on a Flongle, and if 200 Mb is plenty, it makes a lot more sense to use a Flongle flow cell rather than a MinION flow cell. Actual yields are often higher than that, but I find it better to prepare for the [almost] worst case.

2023-Mar-16

Chimeric reads

Chimeric reads can happen during sample preparation (where sequences are enzymatically ligated together prior to sequencing), and also in-silico as a result of the pores getting reloaded too fast and the software not detecting the switch.

Most of the time this can be picked up by mapping reads to the ONT adapter sequences, and discarding any reads with internal adapter sequences (rather than the expected end-joined adapter sequences).

Our chimeric reads paper demonstrates this phenomenon on the early ONT ligation kits. The kits are different now, but the ligation part of sample prep is still essentially the same:

https://doi.org/10.12688%2Ff1000research.11547.2

The occurrence of sample prep chimeric reads can be substantially reduced (or possibly eliminated) by using rapid adapters, which use a chemical reaction to join adapters onto DNA that has added adapter bonding sites. Those sites can be added via PCR, fragmentation, ... or ligation [probably not the best idea due to the increased risk of chimeric reads].

Sample QC

If you want to do QC on nanopore reads (especially ones that don't have constant lengths), please use something that is specifically designed for nanopore reads. Nanoplot, for example:

https://github.com/wdecoster/NanoPlot

https://doi.org/10.1093/bioinformatics/bty149

FastQC is designed for short reads. There is an expectation with FastQC that all reads will have identical length, which makes it easier to determine from looking at a position / quality plot where issues are in reads. For long-read sequencing (in this case nanopore), this does not apply: reads are variable length with no length-based quality degredation (i.e. similar base calling qualities across the full length of the read). This means that the average base quality from a read is a good enough measure for QC, which is why it is used for nanopore-specific QC programs. Position-based QC is not appropriate, because the different read lengths lead to shot noise effects.

[While it might be useful to look at position-specific read qualities near the start and end of a read (where pore loading and ejecting may contribute to lower qualities), I'm not aware of any programs that do this.]

2023-Mar-12

Flow cell storage

I only remove waste liquid before returning flow cells, or if there's a chance it could overflow if I put more liquid into the sample loading port. Leaving the liquid in there acts provides a bit of protection against the flow cell drying out, although it's nowhere near as good as a piece of good sealing lab tape over the waste ports.

If you want to store exactly at 4ºC, you should make sure that your fridge doesn't dip below that temperature (i.e. use a temperature logger), because flow cells can be destroyed by getting too cold - I had this problem with a bar fridge that I used for demonstrating live sequencing during a conference. If you can, store the flow cells in the middle of the temperature range indicated on the flow cell packet.

2023-Mar-09

Wildflower flow cell patterns

As I've reported briefly in the past, we've been seeing a lot of R9.4.1 flow cells with what I like to call a "wildflower" pattern, because the detailed channel plot looks a bit like a field of wildflowers, with a large mix of different types of channel classification and no obvious pattern.

What happens with these flow cells is that they tend to crash quickly. We see steep decreases in sequencing capacity, in this most recent case losing about 50% of sequencing capacity overnight, despite having a well-loaded flow cell (this was with a rapid adapter cDNA kit; it's quite rare for us to see loading over about 75%). I've noticed that this sharp decrease is commonly linked to the "active feedback" and "channel disabled" pore states.

This looks quite different from the more homogeneous channel state that we expect to see on flow cells (here shown is a reasonably good channel state image; I think this was prior to loading anything on the flow cell, which is why the total read count is so low).

Unfortunately, the wildflower pattern isn't recognised as bad in the initial flow cell QC check (the QC message for the first flow cell was "Flow Cell FAT13376 has 1059 pores available for sequencing"), and it's only once we start sequencing that we notice this bad patterning.

I haven't yet worked out why this is happening, but it's not to do with sample prep. In the past, we have seen the same effect from flow cells when there's nothing loaded on them, and also prior to loading any priming mix (i.e. when doing a 5 minute sequencing run before doing any priming). This does mean that we can do an additional channel state QC prior to priming to identify these cells... but we need to remember to do that.

It also doesn't seem to be related to storage time (we've had this happen on flow cells that were checked the day after they arrived) or shipping conditions (about a year ago we had two flow cells that had a long stopover in Australia, one of which exhibited this problem, one which didn't).

It has been challenging discussing this issue over email with ONT in the past, because every support email seems to end up with a different person. They aren't generally aware of this problem, so almost always fall back on it being a sample loading issue (which is certainly not the case in situations where we haven't done priming or loading). This means it can be a challenge getting replacements for these flow cells: they're showing fine on the initial count-based QC, so don't count as poor quality as part of the standard "fewer questions asked" warranty replacement.

This does seem to be more of an issue currently with R9.4.1 flow cells, but I thought I'd do a more full writeup just in case it ends up as an issue with R10.4.1 flow cells later on.

Flow cell operation

Flow cells function a bit like a battery. They have a certain amount of sequencing capacity in them (about 1-3 days, depending on the quality of the sample prep and flow cell), and that capacity can't be recharged by washing. The purpose of the wash kit is to clear DNA, RNA and other debris from pores so that they can be rerun on other samples.

Unlike batteries, flow cells actually perform better if they are working under high load as much as possible (i.e. the pores are fully occupied and actively sequencing). For the best sequencing yield, it is better to multiplex samples and run them at the same time on a fresh flow cell.

2022-Mar-07

Flow cell storage

5 months should be fine, assuming the fridge is a commercial one which doesn't drop below 4C. It's less likely that ONT will give you a refund for a bad flow cell, but we've found that as long as flow cells don't dry out or freeze, they still seem to be usable after a year or so.

I don't recommend ordering more than is needed in the next 1-2 months (bearing in mind that an order shipment can be spread over a year, with payment on delivery), but don't chuck away flow cells that are past their expiry date without running a QC check on them first.

2022-Mar-03

ONT Customer support

Unfortunately, this is par for the course for ONT: gaslighting users that complain until there's a sufficient amount of supporting posts on the community, and then suddenly "We have seen that a small number of users are experiencing issues. Here's a software / hardware / protocol change that definitely fixes the problem. No, you can't see our data that demonstrates this; just trust us."

Based on how they have dealt with issues in the past, I don't think much will change with regards to support.

It's a shame, because it ends up being a substantial time and money-saving approach on all sides to pay attention to assumed-rare concerns early on in the process of development. It'd be a much better process to work with the community to understand the problems, rather than ONT doing their own internal research and coming back with an imperfect fix that they believe is perfect, but fails to account for something obvious.

2022-Feb-25

Flow cell degredation

Flow cells get worse over time, and washing out flow cells doesn't improve the time-related degradation. To get the maximum yield from multiple samples, it's best to multiplex and run them all on a fresh flow cell.

There are multiple reasons for this. On MinION and PromethION flow cells, there are four sequencing wells electrically linked up to each signal sensing circuit. When a run starts, it chooses a working well from any available working wells. Over the course of the run, pores stop working for various reasons (they're biological systems), so the channel will switch over to another working pore. Eventually all four pores will stop working. If a run has shifted from 1500 available pores down to 500 available pores, there's a good chance that most channels will only have one or two linked working sequencing wells, increasing the chance the entire channel will stop working in the subsequent runs.

Another reason is carrier ion depletion. Flow cells start with a certain amount of carrier ions above the pores, and the electrical current shifts these ions through the pores together with the DNA / RNA sequences. Over time, the ratio of carrier ions on each side changes, requiring more energy to push the carrier ions and polymers through at the same rate. The flow cells compensate for this by making the voltage more negative, but this change in voltage is also more damaging for the pores. Eventually the carrier ions will be so depleted on the input side such that no voltage will be able to shift more carrier ions across without destroying all the pores, and further sequencing then becomes impossible.

2022-Dec-22

Metagenomic sequencing

The MinION flow cells sequence DNA at about 400 bases per second across up to 512 different sequencing channels, so at 75% efficiency, it will sequence about 150,000 bases per second, equivalent to a 16S amplicon being covered about 100 times per second (or about 350,000 times per hour).

Decisions on how long the sequencing should be run for are dependent on the specifics of the project. My rule of thumb is to require a factor of 10 higher coverage than the minimum needed for a target threshold, i.e. if I want to check for presence/absence of something that's in 1% of the sample, then I would prefer about 100 x 10 = 1,000 times coverage (which would be easily reached within the first minute or so of 16S sequencing, assuming everything's going okay).

However, bear in mind that 16S is targeted sequencing, not metagenomic sequencing, and can only hint at what's in a sample. 16S is also a quite conserved gene, with a fair amount of horizontal transfer happening between bacteria, so even a perfect match does not necessarily indicate confirmation of population composition.

For species or strain-level identification, it would be better to use a whole genome shotgun sequencing approach, such as the rapid PCR barcoding kit (which has almost the same sample preparation protocol as the 16S kit, but with an additional fragmentation step at the start).

2022-Dec-15

Sequencing yield

Nanopore sequencers are a continuous throughput device, unlike pretty much every other sequencer on the market. It's better to think of them in terms of "bases per second", rather than total yield, although sequencing rate does decrease over time as the flow cells get worn out.

MinION flow cells can have up to 512 channels running at 400 bases per second, so the maximum theoretical yield is 204,800 bases per second, or about 700 Mb per hour, or about 17 Gb per day. Due to the decrease in sequencing rate over time, and variation in sample type that influences flow cell performance, a yield of 5-20 Gb seems to be typical (based on a poll I did on Twitter in June 2022).

Flongle flow cells (need a Flongle adapter) can have up to 126 channels running (i.e. theoretical maximum sequencing rate of 50,400 bases per second), but due to the design of the flow cell (no "backup" sequencing wells, vs the MinION's 3 backup wells), yield is less than what would be expected based on the performance of a MinION flow cell. I've found that most of my runs are 0.1 - 1 Gb for a 1-day Flongle run, although I don't put much effort into getting the best out of Flongle flow cells because even 0.1 Gb is plenty for our most common use of plasmid sequencing (which also applies to amplicon sequencing).

PromethION flow cells (need a PromethION sequencing device) can have up to 2675 channels running (i.e. theoretical maximum sequencing rate of 1,070,000 bases per second), with similar performance per well to MinION flow cells. They're more commonly used for cleaner samples by established sequencing centres, so yields tend to be higher than MinION flow cells (even after adjusting for pore count), about 50-150 Gb per run seems to be typical.

These sequencing rates and yields are likely increase in the future, as sequencing reagents, pore chemistry, and software improve. In particular, the yields I'm familiar with are based on R9.4.1 flow cell chemistry, and ONT is now shifting to R10.4.1 chemistry (and associated sequencing kits), which improve yield by making the pores better at capturing sequences, and increasing their endurance over long sequencing runs.

Because nanopore is a single-molecule sequencer, error rates are consistent across the entire length of the read (as is also the case with PacBio), and sequenced read lengths tend to map quite well to input DNA length. Due to the way the sequencer works, nanopore sequencing actually gets better yield with longer reads (up to about 50kb), because there's a small delay between one read leaving the pore and another one loading on.

2022-Nov-29

Flow cell depletion

Run time definitely affects flow cell use. There are carrier ions in the flow cell that travel from the input to the output channel over the course of a sequencing run (or multiple sequencing runs). As the relative carrier ion concentration changes, the voltage needs to be made more negative to maintain a proper current flow for sequencing. Eventually, the carrier ions are depleted so much that this current can't be maintained, and the flow cell stops sequencing.

This depletion cannot be recovered from (or reset) by end-users, so a flow cell will eventually become useless.

FWIW, read length also influences the number of times a flow cell can be re-used. There's a small delay between when one DNA strand finishes sequencing and the next one loads into the pore, during which time the pore is unblocked and letting carrier ions rush through. Longer sequences (as long as they don't knot up and block the pores) mean that fewer carrier ions will get through the pores in the same amount of time, meaning the flow cell will last longer.

Because the pores are proteins, they can also be destroyed irreversibly by other means, which is another reason why flow cells will eventually degrade over time.

2022-Nov-11

cDNA Assembly from nanopore reads

This is one of those rare circumstances where sequencing accuracy is going to be a problem. Here are some of my thoughts on the matter:

  • According to Callum Parr, "The direct cDNA kit is trash" [ref. This might be related to accuracy issues [ref.
  • By design, direct cDNA has low yield; it's explicitly not an amplification protocol. The idea is that the lack of amplification means that reads will be longer, and read distributions will be more representative of the actual composition. For de-novo assembly it's better to have sufficient coverage than a perfect compositional match, but PCR amplification could lead to truncated sequences if the polymerase falls off before the reverse transcription is complete. Some people have found direct RNA sequencing produces longer reads than direct cDNA, but the base calling algorithms are worse; nanopore hasn't updated their RNA callers in tandem with DNA calling.
  • As far as I'm aware, there is no established protocol/workflow for transcriptome assembly from nanopore reads alone (except possibly RATTLE). Because there is a big dynamic range in transcript abundance, one suggestion I've seen is to use Flye in metagenomic assembly mode.
  • If you have Illumina cDNA reads available, then you can use Trinity in long-read mode. This requires error-corrected long reads (e.g. processed through the read correction step of Canu). The long reads are used to cluster short reads for subsequent assembly.
  • The most common nanopore base calling error is short deletions (e.g. 1-2bp), likely due to variability in the translocation speed leading to dropped signal. For reference-based mapping this is not usually a problem, but it is a problem for de-novo assembly, especially when base-accurate protein translation is needed.

2022-Nov-03

small-scale reference-based analysis

For amplicon mapping, where mapping at the single base level matters, and there are only a few samples, my recommendation would be to use a trained LAST model, create a BAM file using maf-convert and samtools, then eyeball the results in a graphical BAM viewer like Tablet, IGV or JBrowse2:

# create index
lastdb -uRY4 -R01 reference.fa reference.fa
# train mapping model
last-train -Q 1 -P 10 reference.fa reads.fq.gz > trained.mat
# do the mapping
lastal -P 10 -p trained.mat reference.fa reads.fq.gz > mapped.paf
# convert output to BAM format [i.e. draw the rest of the owl]
maf-convert sam mapped.paf | \
  samtools view -b --reference reference.fa - | \
  samtools sort > reads_vs_mapped.bam
# index created BAM file
samtools index reads_vs_mapped.bam
# view the mapped reads in your favourite BAM browser
Tablet reads_vs_mapped.bam reference.fa

This approach won't give any quantifiable numbers (although there are ways to extract that from the BAM files), but should get quickly to answering specific questions about disease-causing mutations.

2022-Nov-01

Metagenomic Assembly

Kat Holt has a lot of useful information about microbial assembly and nanopore sequencing; I recommend having a look:

https://holtlab.net/category/software/

30-50X is a good target for genome assembly. Really high coverage can cause rare artefacts to pop out in the final assembly, and lead to assembly taking heaps longer than it does at lower coverage. One thing to bear in mind is that nanopore reads have a long tail of low-quality reads, so it can help to do some light filtering based on average base quality prior to assembly. I've found that aligned accuracy is well correlated with predicted accuracy by the basecaller, and have a plasmid assembly protocol here which discusses how to do quality filtering:

https://dx.doi.org/10.17504/protocols.io.by7bpzin

From what I've heard from others, it seems that flye is pretty good for metagenomic assembly, especially for mixed samples where there will be multiple genomes with different average coverage levels. Racon probably isn't needed with recent nanopore basecalling [i.e. sup model], and if any correction is used, it's best to concentrate on INDELs and stay away from SNPs. Nanopore SNP consensus accuracy, especially when it's not mixed with a homopolymer sequence, is really good, and there are a lot of programs that will over-correct nanopore assemblies at SNPs.

If possible, I'd recommend separating out forward-mapped and reverse-mapped reads, carrying out separate assemblies, and concentrate on fixing any places where the assemblies don't agree. The different electronic profiles for the different strands should increase the likelihood of finding a correct consensus assembly.

Another thing to try is Trycycler (Holt lab again...), which uses the results from multiple assemblies to create a more complete/correct assembly: "if all goes well when running Trycycler, small-scale errors will be the only type of error in its consensus long-read assembly."

https://github.com/rrwick/Trycycler/wiki#an-important-caveat

For final QC and annotation, it seems like Bakta is the recommended approach, according to the creator of Prokka (which was the previous gold-standard annotation software):

https://twitter.com/torstenseemann/status/1565471892840259585

https://github.com/oschwengers/bakta/releases

cDNA Data Analysis

I do things quite a lot differently with ONT vs Illumina data.

For Illumina data, I need to apply transcript length normalisation (I call it VSTPk) when comparing different genes within the same dataset (some details here). Illumina can't really properly do isoform expression, so I collapse the reads down to gene level prior to doing a differential expression analysis.

For ONT data the transcript sampling carried out by the sequencer is at the transcript level, so I don't apply any transcript length normalisation: at similar expression levels, a 300-base transcript will appear in the reads at around the same proportion as a 2000-base transcript. Reads produced by the sequencer are full-length transcript (with some filtering to make sure the only analysed reads are full-length), which means there's no guessing about what the full transcript would look like (as is needed with Illumina reads). Expression (in terms of reads mapped to each base position) is also very flat across the entire length of the transcript. This means that I can be confident about doing isoform-level differential expression analysis (down to very low counts), doing the collapsing to gene level after results have been produced. Closest/best thing I have to a reference for that is here, but it's still a work in progress. Our first nanopore sequencing cDNA studies led to this paper.

The sample prep for Illumina vs ONT is different as well. The most common stranded Illumina prep involves fragmented cDNA amplified from random hexamers, which means that you get lots of different types and stages of RNA (including ribosomal RNA, microRNA, and unprocessed intronic "straight off the genome" RNA). Some people want that (e.g. looking at transcript velocity by comparing pre-processed vs mRNA), but most don't, and it's a bit of work at the sample prep level and in analysis to clean the data to get rid of the unwanted RNA.

All ONT RNA and cDNA sequencing kits use polyA anchors (somewhat similar to how most single cell sequencing sample prep is done), so non-polyA RNA needs to be polyadenylated prior to sequencing. There's a lot less noise, but the information produced by default is less complete. If you have a particular gene target that you want to sequence, the polyA anchor can be replaced with a gene-specific target and amplified by PCR, meaning that only that target will get seen / sequenced by the sequencer. This target-specific cDNA sequencing is the method that would be best used to look at isoform expression of a specific gene.

Reads can be mapped to the target using minimap2 or LAST, and the results of that mapping used to filter the reads specifically for that target. For a handful of targets, I have previously used CD-HIT-EST to isolate different isoforms, but expect there are more recent & better tools available. I'm not really sure what the next step after that is; I usually just hand the transcript isoforms over to the biologists and let them peer at the sequences to make their own discoveries.

2022-Aug-05

Adaptive Sampling

For adaptive sampling, scale matters: the length of non-target regions, the length of the Region Of Interests (ROIs), the N10/L10 of the reads, and the separation distance of the ROIs. Using the ONT recommendations, if your regions of interest are less than twice the N10/L10 away from each other, then they should be combined into a single region. That will give the best chance of capturing target reads.

Scale will influence the frequency of the following situations:

  1. Outside ROI, outside buffer [non-target; discarded]
  2. starting bases within ROI [target; retained]
  3. partially within ROI, starting bases in buffer region [target; retained]
  4. outside ROI, starting bases in buffer region [non-target; retained]

The ratio of Situation #1 to Situation #2 can help to work out if adaptive sampling is useful at all.

Situation #4 is the only situation that is a potential problem (and would be influenced by changes in the buffer region size). A smaller buffer region will reduce the occurrence of non-target retained reads, but it will also reduce the occurrence of target retained reads (i.e. situation #3).

It's a false positive / false negative tradeoff. If you want to guarantee that the only retained reads are those that include the ROI, then set the BED region to be exactly the same as the region of interest; this will drop some reads with partial hits. If you want to keep as many reads including the ROI as possible, then set the BED region to include a buffer on both ends that is the maximum expected read length; this will include some reads with no hits.

2022-Jul-30

Reduced representation sequencing (ddRAD-Seq)

From what I can see about the method on the Illumina website, ddRAD-Seq looks similar enough to rapid PCR barcoding, which is an established MinION protocol that works well for metagenomic purposes. Unlike ddRADSeq, the transposase fragmentation is essentially unbiased (from what I can tell in the data I've looked at), so it should work well for low coverage sequencing.

For sequencing particular genomic regions, it may be better to design PCR primers that span regions of at least 600bp (probably what would be used for Sanger sequencing - although the longer the better), then use a Rapid Barcoding kit to fragment the amplicons. This will allow for a lot more samples from the same total yield, potentially opening up the possibility of using a Flongle flow cell ($67.50 per run, rather than $900 per run) to still get enough data.

It would be technically possible to take the product of a restriction enzyme double digest, ligate nanopore sequencing adapters, then sequence on a MinION, but there's not really much point in doing that with other options available. There are a few other targeted sequencing approaches (including a CRISPR/Cas9 kit, and an adaptive sequencing approach that does target selection during sequencing); I've just discussed the cheapest / easiest ones to get started with.

Aside: ONT is in the process of improving their kit chemistry over the next few months; there should be better fragmentation kits available by the end of the year.

2022-Jul-01

Plasmid sequencing on Flongle

The standard ONT Flongle sequencing protocol is to do half-reactions and load half libraries, so I keep with ONT’s suggestions in that regard (including using 0.5µl adapter).

I increase the incubation time versus the ONT protocol because I don’t heat in a PCR machine – I use my hands and a heating block. The increase in time was a recommendation from Dr. Divya Mirrington during a sequencing workshop she led at our institute.

Another tip from Dr. Mirrington was to pool samples together and take the required amount from that pool. As long as the input DNA is sufficiently concentrated (which is usually the case), then this should work fine. It is possible to eke out a little bit more sequencing from small samples by multiplexing them together, in which case concentrating with XP beads is needed, but I’ve found that I don’t usually need to do this for plasmid sequencing because the yield requirements are so low (i.e. only 100 reads from a 10kb sequence, which is easily within the Flongle output, even when multiplexing 12 samples for a single run).

ONT recommends a few things that I haven’t found to substantially change our Flongle output. This includes always using XP beads after pooling (even when sequencing a single sample), and using the Flongle reagents in glass vials. For us, the most significant change that gave us consistently good results was to load Flongle flow cells using the negative pressure “gentle loading” technique.

2022-Jun-02

metagenomic sequencing

RBK004 is a multiplexing kit, but it's less so than RBK110.96.

The statistic that matters is the number of adapted molecules that are loaded onto the flow cell. Given that, RBK004 is designed for up to 12 samples and has input requirements of 400 ng, the input requirement for the [up to] 96-sample RBK110.96 would be expected to be (12 / 96) * 400 = 50 ng... which is precisely what you state.

These barcoding kits are typically used to sequence multiple samples at the same time; sequencing fewer samples will require more input DNA to compensate. The flowcell-loaded input requirements for the kits are identical: a single sample on either of the RBK kits will have the same input requirement.

As an alternative, consider the rapid PCR barcoding kit (SQK-RPB004), which fragments DNA and simultaneously adds ligation sites for rapid PCR primers. The primers are barcoded, the fragmentation is not, which means that separate PCR reactions need to be carried out for each sample. After amplification, the PCR products are quantified, pooled and then rapid adapters are added. It's a quick and fairly cheap protocol with minimal external costs (the most significant being polymerase), and will give a higher yield than the unamplified barcoding kits. However, because it's a PCR amplification kit, reads will be shorter (most relevant for large genomes; less relevant for bacterial genome assembly), and it will not be possible to do any methylation calling on assembled genomes.

2022-May-20

Accuracy

Given that ONT are claiming Kit14/R10 with 520 bases per second gives 99.0% modal accuracy, which is already better than any R9.4.1 kits, It's not so clear to me why 260 bases per second at 99.6% accuracy is needed at all. Not for genome assembly, at least, or anything else that uses consensus results. I expect a doubling of coverage should be able to bump up the consensus accuracy a bit higher than a halving of sequencing speed.

Regarding amplicon sequencing and consensus variation, it depends on the variants. I like to dig deep into read mappings and look at the sequence context. I also don't trust minimap2 to give reliable local alignments.

It has been my experience that ONT reads are very reliable for SNPs, especially when ensuring strand consistency. For single-base INDELs, especially deletions, especially homopolymers, I have less confidence about treating nanopore reads as correct. The sequencing and basecalling model is prone to skipped signal.

If I'm concerned about an INDEL, then I'll use another technology to confirm.

Once I get beyond single base changes (excluding homopolymers) in a consensus sequence, it's more difficult to justify a sequencing error, given how sequencing and basecalling is done. It's more likely that there's an unmodeled base modification present, or a PCR artefact.

The concerns I have relate to systematic error, which can't be brushed away with more reads.

It's a software problem: a basecalling problem, to be specific. There's a lot of sequence-dependent variation in the electrical signal that leads to unmodeled variation being interpreted as something else. I still believe that this systematic error can be dealt with by more careful observation of the sequence.

For example, Maybe there's a particular sequence 25bp upstream that leads to the helicase skipping a bit, causing a rush of nucleotides and/or misclassified bases.

2022-May-05

Ligation kits

Ligation is a slower protocol with more hands-on time, increases the chance of library prep read chimerism, and is more expensive.

The primary advantage of the ligation kit is that it produces a higher pore occupancy under typical sample prep conditions, leading to a better sequencing yield per flow cell (i.e. usually cheaper per base, more expensive per run). A secondary advantage is that reads are full-length.

From a downstream analysis perspective, full-length reads only matter when single molecules matter. If consensus is good enough (which is very common), then fragmented reads are fine as well.

2022-May-02

Mitochondrial Genome Assembly

I've found that for mitochondrial assembly, mapping is better with uncorrected reads, at least for high-accuracy called reads. I've yet to check for super-accuracy called reads, but I expect it to be similar:

https://github.com/marbl/canu/issues/1715

Curiously, I've subsequently found that while mapping is better with uncorrected reads (suggesting they are more accurate), canu performs slightly better with its "corrected" reads. In hindsight, this should have been obvious, as the people developing their assembler will be familiar with the error modes of their corrected reads, and therefore understand how to fix them.

After revisiting a bunch of approaches with my nippo reads, my current best results have been achieved by filtering out the most accurate sequences (using mean estimated accuracy reported by guppy, e.g. as found in the sequence summary file) to a depth of 40-100X, then assembling using canu with default parameters.

Assembled sequences can be corrected using medaka:

https://github.com/nanoporetech/medaka

While Medaka has been trained to correct draft sequences output from the Flye assembler (mapped with minimap2), it seemed to work okay for me for our mitochondrial genome when using LAST for mapping, and canu as output. It still didn't produce a 100% accurate genome; it had 10 errors from ~13k bases.

Regardless of assembly method, you should be manually curating assemblies prior to submission. The mitochondrial genome is small enough that it can be checked for obvious gene sequence issues, which is probably why NCBI have their automated submission checks.

If there is an internal stop codon issue with mitochondrial genomes, check to make sure that the nucleotide translation code is correct. While the standard code works for most nuclear DNA, the mitochondrial translation is different:

https://www.insdc.org/genetic-code-tables

What we also noticed with our worm mitochondrial genome (as has been observed in other organisms) was that there are some genes that end within the mitochondrial sequence before the end of the stop codon, and the last codon is completed by the addition of polyA sequence. There's a worm stop codon 'TAA', so if the gene ends on 'T' or 'TA', then translation will be stopped in processed mRNA. More details about this can be found in our mitochondrial assembly paper:

https://doi.org/10.12688/f1000research.10545.1

2022-Apr-21

Bacterial Sequencing

In my opinion, two important questions for bacterial genome assembly are:

  • How precious is my sample?
  • How much am I willing to pay to get sequencing done?

If samples are difficult to obtain, or money is tight, then it's important to get the best performance out of a single flow cell. That means taking more care with sample prep, quantifying input DNA amounts, running a TapeStation/Bioanalyzer to work out size distributions, excluding any potentially contaminating samples, carefully balancing barcodes based on average fragment size, and underestimating the total yield (i.e. the 24-genome scenario I've described above).

If, on the other hand, samples are not precious and it's okay to run multiple flow cells, a more haphazard approach can be carried out. Run 96 samples on the first flow cell, and set aside any samples that reach the genome coverage threshold (e.g. 20X from >10kb reads). Run the remainder of the samples on subsequent flow cells, continuing to exclude any samples that exceed the coverage threshold from the total reads from all sequenced runs, or any samples that are clearly not going to hit their target after any amount of sequencing. This approach will probably end up being more successful and cheaper per sample, but requires more time, and more thinking / analysis between runs.

cDNA Sequencing

For differential expression from mouse cDNA with the PCR-cDNA barcoding kit, I've been happy with samples that have >1M reads. We can usually get this from a MinION flow cell when multiplexing six samples together - our usual flow cell yield is 8-12M reads. I expect results would be similar for human cDNA.

The most important thing for good run yield is to make sure RNA extraction is carried out from fresh sample, and converted to cDNA as quickly as possible. We have needed to play around with the number of PCR cycles, so be prepared to experiment with a non-precious sample first. I've found that sequencing works best when the library quantification for each sample just prior to pooling is at least 20ng / μl.

2022-Apr-13

cDNA kit ordering

For some odd reason, it seems that ONT only provides their RNA control (RCS) in the PCR-cDNA barcoding kit, SQK-PCB109. I recommend people purchase this kit instead of the PCR-cDNA sequencing kit for a different reason - it has an identical sample preparation as SQK-PCS109, but allows for samples to be multiplexed on the same run:

https://store.nanoporetech.com/productDetail/?id=pcr-cdna-barcoding-kit

2022-Apr-12

cDNA Sequencing

The ONT per-base cost for PromethION flow cells is similar to a NovaSeq S2 chip. ONT reads for cDNA average around 600bp for well-prepared fresh samples, so assuming this is being compared to 50M x 75bp single-end reads per sample, you would expect about 5M reads per sample from a PromethION with a similar cost.

Something else that is worth bearing in mind is that the ONT kits don't produce any intronic reads. These can occupy a substantial proportion of Illumina sequencing runs, and have an impact on transcript count estimation. From a base-level transcript coverage perspective, ONT reads have a much flatter profile, which also helps for quantification and isoform discovery.

When I do sequencing on MinION flow cells ($1000 USD / run), I aim for at least 1M reads per sample. This is usually achievable without much effort when multiplexing up to 6 samples per run with the PCR-cDNA barcoding kit. I wouldn't recommend single-sample kits (e.g. PCS111), because I haven't found much benefit for downstream analysis for over 1M reads per sample. ONT claims that their newer Kit14 kits should have even better reliability (thanks to an increased sensitivity adapter and the use of UMIs), so it might be possible to do 12 samples from a single MinION flow cell with >1M reads per sample.

cDNA Pore Blocking

We have found that freshly-extracted RNA converted as soon as possible into cDNA produces the best cDNA runs (in terms of read length distribution). Unfortunately this cDNA also blocks pores, which can be demonstrated by an increase in the teal 'Unavailable' statistic in the extended MUX scan plot. Doing a nuclease flush (i.e. using the ONT wash kit) then library reload will usually fix this, but that requires additional library kept for loading, and monitoring the flow cell to see when the sequencing rate starts to drop off.

I find it a little strange that our amplified cDNA seems to block at shorter lengths than the gDNA blocking reported by the community (~10kb cDNA vs ~100kb gDNA).

It might have something to do with sequence similarity. If there's a high-abundance transcript, it's more likely to pair up with other copies of itself (we noticed this when doing our chimeric read investigation, and in-silico chimerism seems to happen more often with amplicon sequencing), which may lead to pore blocking.

2022-Mar-13

Canu Amplicon Assemblies

Assembly failures with Canu are likely to happen if amplicon reads are generated using the ligation kit. Canu expects that reads are fragmented representations of the reference sequence, and has a threshold beyond which overlaps will not be considered. I can't recall what this is, my memory is that it's at least 50%, and could be as high as 80%.

In any case, if all the sequenced reads are full-length, Canu will not attempt an assembly using the default options. This will also happen if most reads are full-length, there is a very high coverage of the target sequence, and Canu is selecting the longest available reads.

The Canu documentation suggests some tweaks to help improve assembly of amplicons, which tell Canu to stop assuming so much about the input data:

https://canu.readthedocs.io/en/latest/faq.html?highlight=amplicon#can-i-assemble-amplicon-sequence-data

2022-Mar-03

Gentle Loading on Flongle

It does not surprise me that ONT staff have a bias towards preferring their own in-house developed techniques and protocols, which are working great for experienced staff. I wouldn't mind the resistance to community-driven change so much if the protocols had proper version control, or if ONT explicitly allowed other people to create their own version-controlled protocols.

In lieu of that, ONT should show the whole community the data: data relating to gentle loading, and also for the glass vial testing that has been done. I want to see multiple tests and error bars. If I can be convinced that the pipette loading method is better, then I would be interested in revisiting it.

I'm not interested in wasting time & money on method development research for ONT, especially for something as fundamental as how to load a flow cell. That's something ONT should have nailed.

The way I see it, given that complaints keep coming through on the community about poor pore performance between initial QC and loading, it suggests a problem with the protocols, not the users. There needs to be some point (the earlier the better) where ONT steps outside their bias bubble and thinks, "Hmm... maybe this protocol is not as good, easy or clear as we think it is."

Within my own reality bubble I observed a difference when I was first starting out using the Flongle flow cells, and have my own explanation for why it improves flow cell loading. In addition, I have found the SpotON-like gentle loading technique to be a simpler and more failure-resistant process - it reduces forces on the flow cell matrix, and also reduces problems with bubbles being introduced into the flow cell. If your goal is to enable analysis by anyone, shouldn't a more fail-safe technique be used as the standard approach?

As long as other community members keep saying that they're using the dropwise gentle-loading method, and it works for them, I'll keep suggesting other people try it.

2022-Feb-22

A Mental Model for Nanopore Sequencing

The mental model that I have in my head of a flow cell is a dam with a lake / reservoir, and holes in the dam that slowly drain the lake, spinning turbines that produce electricity (i.e. current). The interesting bits in the lake are the things going through the holes that aren't water, but the water flow is the only thing that can be measured: there are no cameras attached to the holes, only current meters attached to the turbines.

It just so happens that those non-water things have a strange enough shape that they interrupt the water flow in large and predictable ways, which means that observing the water flow can be used as a proxy for observing the things themselves.

If the holes get blocked, nothing can move through the holes, so there is no current flow [sequencing cannot happen]. In that case, cleaning the holes allows the dam to work better, because water can once again flow through the holes and generate electricity.

However, eventually that water reservoir will get depleted. If the reservoir is fully drained [ionic balance of the carrier ions in the flow cell is equalised], then it doesn't matter how clean the holes are; the dam will produce no more electricity [sequencing cannot happen].

As an aside, this model helps me to understand why keeping the pores occupied with long DNA prolongs the sequencing life of the flow cell. The flow of carrier ions is reduced when a pore is actively sequencing, so the reservoir drains slower, and can keep sequencing for longer.

2022-Feb-13

Ampure XP Purification

If bead purification is done after adding the rapid adapter (as per the Flongle protocol), that's a problem. It also could be a problem if done before and there's still some ethanol hanging around in the eluate. The sequencing adapters get destroyed by ethanol; that's why different buffers are used after adapters are ligated or bound to the prepared template sequences.

When using a rapid kit with a single sample of clean DNA (as is the lambda DNA provided by ONT), sample preparation can be done without doing any bead purification at all. I find with the rapid kit that it's best to have all the purification and concentration done before starting the ONT protocol.

2022-Feb-11

Chimeric Reads

Every platform that uses ligation in sample prep has issues with chimeric reads to varying degrees. The difference with nanopore sequencing is that they can [usually] be easily detected and filtered out.

As one example of what can happen, we were recently doing single cell sequencing of ~2.7k cells, sequenced on a NovaSeq, but when I looked at cell barcodes, I got over 100,000 unique cell barcode combinations that matched an expected 27bp pattern (most with only one read). What I think was happening is that a small proportion of reads from the other run at the same time as our run were bleeding through into our sequencing run. There has been a recent discovery of SARS-CoV-2 sequences in metagenomic samples from Antarctica in 2018 which is more likely explained by barcode spillover, as the samples were sequenced in China near the start of the pandemic.

We did an experiment a few years back that tested a few different approaches, and couldn't find any way to get rid of the chimeric reads prior to sequencing:

https://doi.org/10.12688/f1000research.11547.2

The best way to get rid of chemically ligated reads completely for nanopore sequencing is to not use ligation. In other words, use kits that use the rapid adapter (e.g. rapid PCR barcoding kit, rapid barcoding kit).

I've found that chimeric reads are really difficult to split properly every time. For many instances, the splitting can be done by finding adapters and splitting near them, but that doesn't always work. Some sequences might be joined half-way through the sequence, others might have no adapters at the join points.

For me, I try to detect them as best as possible when demultiplexing reads, and exclude any reads that get mapped to multiple barcodes. As there doesn't seem to be any systematic reason for chimeric reads, excluding them only impacts total yield (and it's only a few percent yield loss).

I use LAST for chimeric read detection. I don't have a fully-automated workflow, but it's quick enough for me to copy-paste and change a few bits as necessary. The protocol I've written up to demultiplex and exclude chimeric reads is here:

https://dx.doi.org/10.17504/protocols.io.b28fqhtn

2022-Feb-10

DNA concentration for Rapid Sequencing

Starting DNA concentration for the rapid kit depends on how many adapted molecules are going through the pores; the longer the sequences are, the more mass is required to make up the numbers.

It's frustrating to me that ONT don't mention ideal molarity for their rapid sequencing kits (at all), but there is information in the LSK protocols [i.e. <1 µg (or <100–200 fmol)], which should work well as a reasonable first guess. There are tables comparing different length/mass combinations with different fmol amounts on the QC pages:

https://community.nanoporetech.com/protocols/input-dna-rna-qc/v/idi_s1006_v1_revb_18apr2016/assessing-input-dna

Many people have recommended the Promega calculator if you want to look at a specific length or mass:

https://worldwide.promega.com/resources/tools/biomath/

My rule of thumb is that at 1.5kb, ng is approximately the same as fmol. I do rough estimates from there (e.g. at 15kb, fmol is about 1/10 of ng amount).

If there's no way molarity can be estimated, even approximately, then my recommendation is to keep concentrations around 20-100 ng/μl; that seems to work well for most applications.

2022-Feb-05

Rapid Barcoding Kit

The RBK110.96 is a very new kit (I think it's only been available for a month or so on the store), and we haven't been able to test that out yet.

With the RBK004 kit, we typically get an assembleable amount of plasmid sequences for up to 6 samples (loaded at approx 20 ng/μl) coming out of the sequencer off a Flongle flow cell after about 2 hours of run time. We've done one run with 12 samples; it took about double that length of time (i.e. about 4 hours). I guess that'd be 10-20 mins for a MinION flow cell. Apparently the RBK111 kits coming out in the next month or so will be better.

Median size / N50 / etc. all depends on the input DNA. Most plasmid reads that come off the sequencer are single cuts (i.e. full-length plasmid, with a bit of the ends trimmed off).

Live Sequencing Demonstrations

A prepared library is a great way to do live sequencing demonstrations. Purified / concentrated DNA works very well with the rapid barcoding kit.

My first public demonstration (in 2016) was with frozen tomato that I extracted at home, then purified and prepared in the lab using a Zymo concentration kit and a ligation sequencing kit (which was the only kit available at the time). Sequencing and basecalling was a bit slower back then; I don't recommend using a slow wireless connection for remote basecalling and competing with a hundred other people for Internet access.

In 2019 I attempted a live extraction and sample prep from a salad (with local basecalling), but mixed it in with a sample I'd prepared earlier (I think it might have been tomato and yoghurt), and the prepared sample is what ended up coming out first from the sequencer.

Another option is having a prepared flow cell, fully loaded. This is the best option if you want to guarantee (as much as possible) that the demonstration will work. Which I did when I was carrying out the first public sequencing run on the Linux version of the basecalling software, because there's plenty of other things that can go wrong.

... I really need to do clipped cuts of those videos; there's a lot of waiting involved in live sequencing.

Regardless of what you decide to do, I recommend having lots of backups. For my 2019 LinuxConf talk I had a prepared sample, but I also brought along a loaded flow cell as a backup, and I also had a full flowcell snapshot of a yoghurt run I did about a week earlier (so I could simulate a sequencing run), and also had the sequencing results stored from when I did that yoghurt run (yoghurt + Zymo clean & concentrate + rapid barcoding kit).

2022-Feb-04

Flow Cell Yield

I've realised that part of the misunderstanding about flow cell yield may come from experience with Illumina sequencing from large service centres, where sample preparation is carefully managed to produce a high cluster density, meaning that the output from different Illumina sequencing runs ends up being very similar (and close to Illumina's reported maximum yield).

As a real-time "sequence anything" device that is targeted more towards individual labs than experienced service centres, it's not unexpected to see large variation in the performance of MinION across different labs depending on sample preparation experience, sample type, sample quality, run time, and flow cell quality.

For the runs I've done, I've found that I get the best estimate of sequencing yield by looking at how a flow cell is performing in its first hour, and multiplying by 24 to approximately work out yield over a two-day run. That's because I've noticed (at least with 109 kits with R9.4.1 flow cells) that I get about 50% of the ideal performance over the course of a two-day run.

But there's a lot of variation in that number. The numbers I'm most familiar with are our cDNA runs, and we typically get around 4-12 million reads with average read length of about 500-1000 bases (i.e. 2-12 Gb per run). I think our highest yield for cDNA has been something like 15 Gb and 18M reads.

Other people have found that HMW DNA will very quickly block sequencing pores, reducing the yield. There are a lot of tricks to work around this (e.g. nuclease digest, needle shearing, buffer tweaking), and it can be a challenge to find information on what can go wrong and how to fix it.

I recommend that new users to nanopore sequencing follow ONT's recommendations and start off with a lambda run, so they can at least know what a good sequencing run looks like. The lambda DNA provided in ONT's control kit is clean and has a reasonably long length (but not long enough to block pores), and works well with most of the sample prep kits that ONT offers.

Knowing what a good run looks like can help users to notice a bad run in the first few minutes of sequencing, stop the run, re-prepare library, and rescue a flow cell, saving thousands of dollars. It's definitely worth it to put in that little bit of extra effort before jumping into the deep end of the flow cell.

2022-Jan-28

cDNA Sequencing

Downstream bioinformatics will be easier and faster when using one of the ONT cDNA kits. I recommend the PCR-cDNA barcoding kit (and associated external reagents), because it allows strand-specific analysis of multiple samples on one MinION flow cell. With the existing kits, I prefer people to do no more than six samples per flow cell, aiming for >1M reads per sample. It may help to do a Flongle run on the same library first to get barcode balancing correct for the MinION run.

Despite having done this for years, we're still struggling with optimising our cDNA sample prep. I'm hoping that the new cDNA kits [SQK-PCB111.12] will fix a lot of the issues we've been having.

I don't recommend using LSK/NBD on cDNA because it loses the advantage of strand-specific sequencing, which makes it more difficult to distinguish between sense and anti-sense RNA.

2022-Jan-27

Base Calling on PromethION P2

I think it's probably worth noting that the effect of using a less-capable GPU with a P2 Solo (without adaptive sampling) is that the basecalling component of the sequencing will take longer. The sequencing should still proceed and finish, and because calling during the run uses a sample of the most recent reads, the calling in MinKNOW should still be a real-time representation of the current state of the flow cell (useful for monitoring Q scores, barcode distribution, translocation speed, etc.).

What this will mean in a practical sense is that there may be a basecalling delay after sequencing is finished (depending on how productive the flow cell is through the entire course of its run). From what I've seen on Twitter, non-ONT PromethION users are currently getting about 200Gb of reads from good runs, which is about ten times the output of good MinION runs. So if a GPU is only just able to keep up with a 2-day MinION run, then the basecalling delay will be a bit over two weeks for a 2-day PromethION run (from a single flow cell).

Miles Benton has indicated that an RTX3090 should at least be able to keep up with 5 MinIONs at once, in which case the basecalling delay from a single PromethION run should be at most about 2 days (but it will likely be much less than that).

Adaptive sampling is another matter, though, because it requires that basecalling can keep up with sequencing in order to work properly. It wouldn't surprise me if adaptive sampling on a P2 won't be practical until there's another order-of-magnitude improvement in calling or sampling (e.g. an efficient squiggle-space adaptive sampling that doesn't require basecalling).

Adaptive sampling and barcode proportions

MinKNOW will be able to show you a very good approximation of barcode distribution even if it can't sequence everything in real time. I've found that barcode distributions after 20 mins of cDNA sequencing are a very close match to the distribution from a two day run (I remember something like about 2% difference last time I checked). It will also tell you how many reads are being [temporarily] skipped, so it's possible to do a bit of calculations to get an estimate of the sequenced counts per barcode based on the basecalled counts.

The benefit that adaptive sampling gives is that the remainder of rejected reads are not sequenced at all by the sequencer (i.e. discarding all but the first ~500bp), which can greatly increase the quantitative throughput of the sequencer (with more benefit the longer the average read length is), and increase the sequencing time spent on reads that are not rejected. These advantages will not be gained for post-hoc analysis after the run has finished, because the accept/reject decisions are made at the time of sequencing.

2022-Jan-07

Flongle as a Sanger competitor

For amplicon sequencing, the cheapest option is to process amplified PCR products (>500bp) with the rapid barcoding kit. Flongle works well for both long amplicons and plasmid DNA, and can be cheaper per sample than Sanger sequencing, especially when sequencing different amplicons in the same run with the same barcode:

  • Flongle flow cells $810 / 12 flow cells
  • Rapid barcoding kit $650 / 12 Flongle runs, 12 samples per run

So that works out to ~$12 per sample (consumable cost only, excluding shipping). As long as amplicons or plasmids attached to the same barcode have distinct sequences (e.g. different genes), the reads can be split out in software after sequencing. I reckon 4 amplicons from 12 samples would still work well on a badly-performing Flongle, which would work out to be $3 per amplicon. This seems to me to be competitive with Sanger, especially if both forward and reverse sequences are needed.

2022-Jan-06

Adapters and molarity

There are molarity effects that come into play when attaching adapters. Smaller molecules have less mass on a gel for an equivalent total molecule count, so will appear fainter. This makes it difficult to work out precisely how many small molecules there are, and counts will usually be underestimated for small molecules. This matters when adding in adapters, because the smaller molecules, due to their sheer numbers, are going to take up the majority of adapters. This is why it's important to use bead cleanup (or similar) to filter out small template sequences prior to attaching adapter sequences (assuming the resulting bias is acceptable).

2021-Dec-23

Metagenomic sequencing

Have a look at the methods for the MinION Analysis and Reference Consortium's metagenomic sequencing paper:

https://academic.oup.com/gigascience/article/9/6/giaa053/5855463#204848914

We used a 1.2 μm filter to remove suspended solids, then a 0.22 μm filter for capturing microorganisms. The paper describes a bespoke enzyme cocktail used with the DNeasy PowerWater DNA Isolation Kit, but some of the consortium found that Sigma MetaPolyzyme works well as well (with the same kit):

https://www.sigmaaldrich.com/US/en/product/sigma/mac4l

I used a 3D-printed filter with an attached 10mm vacuum tube for my filtering. It has a base that screws onto a GL45-threaded bottle (e.g. the blue-lidded bottles that are commonly used in the lab), a hexagonal grid that sits on top of the base (to support the filter), and a funnel that screws onto the top. It also needs a 10mm push-fit tube connector (or similar) for connecting up the vacuum:

https://www.prusaprinters.org/prints/79389-vacuum-water-filter

Because we were doing rapid PCR barcoding for sequencing, we weren't too concerned with breaking up the DNA with the PowerWater kit. It may be necessary to modify the preparation process if reads longer than 1-2 kb are needed.

2021-Dec-17

Sequencing Duration

Nanopore sequencers will keep going for as long as you let it run; the user is in control of deciding when to end sequencing. It'll even keep trying after actively sequencing pores drop to zero.

The nanopore sequencers work a little bit like a battery (or maybe a capacitor): they have more capacity for sequencing at the start of the run, and sequencing gradually slows down over time.

If you want to get the best sequencing performance [i.e. fastest time to results], then you can run the flow cells for a short period of time (e.g. 6 hours), and switch over to a new flow cell at the end of that. If you want to squeeze out every last drop of sequencing capacity from the flow cells, you could run the flow cells until the number of available pores drops to zero (possibly with some flow cell washes in the middle to clear away some gunk that clogs the pores during the sequencing run).

Run time depends on application: a confirmatory sequencing of a PCR product could be done in less than half an hour, whereas a human whole genome sequencing run could take a few flow cells and a week's worth of sequencing (depending on sample preparation ability).

Small samples

It's very unlikely that libraries loaded into the flow cell will deplete completely before the flow cell stops being able to sequence anything more. The flow cell just doesn't have the sensitivity to pick up and sequence everything that it's given.

In addition, running a flow cell on empty (i.e. with nothing sequencing) is a quick way to run it down, because the sequencing capacity depletes quicker when the pores are unoccupied. It's almost never a good idea to load low inputs onto a flow cell. If there are fewer than 100 reads after 10 minutes of sequencing, it's much more cost-effective to stop the flow cell, work out what went wrong, and load a better library next time.

If it's not possible to get away from a small sample, a better approach is to load the sample together with a long, high quality DNA library (e.g. ONT's lambda control sample). This will keep the pores occupied, allowing more sequencing time for picking up the rare sample. After sequencing has finished, the sample reads can be fished out in software by discarding any reads that match the lambda genome.

2021-Dec-16

Sample preparation

Performing an additional QC (tape station or gel) is a good idea, especially when it's the first attempt on a new sample type. Understanding the relationship between the library length and the sequenced read length is useful for improving yield.

I've found that 20-100 ng / μl starting concentration seems to work well for sequencing. Most often we're sequencing DNA from plasmids 5-12 kb, which has a very tight length distribution from a single cut from the transposase enzyme.

If you've got plenty of library, only add in what you need for good sequencing. It is possible to give the adapter binding process too much template, and to overload flow cells, so pay attention to the fmol recommendations in the ONT sample preparation guides.

If you do have enough library for reloading, consider doing a quick initial run (e.g. 10-30 minutes) to check barcode proportions. After looking at the results from that quick run, you can rebalance the barcodes via a second adapter preparation and library load (i.e. add more of the samples that are under-performing).

2021-Nov-18

Loading Flongle flow cells

Here's my current approach for loading Flongle flow cells using a P200. I try to avoid pipetting directly into the loading port:

  1. Unseal the flow cell
  2. Add tape to waste port B
  3. Drop 25 μl of flush solution onto the loading port to form a dome
  4. Place pipette into loading port and dial up about 5 μl to check for bubbles (dial up to remove bubbles if they exist)
  5. If liquid is not dropping from the loading port, place pipette into waste port A and dial up 20 μl or until liquid starts dropping from the loading port
  6. Set pipette to 20 μl, press down while in mid air to expel air, then place into waste port A and slowly release the plunger. If liquid starts dropping from the loading port, stop releasing and lift up the pipette.
  7. Repeat step 5 with a faster release speed until liquid starts dropping from the loading port

Occasionally I notice bubbles at the loading port while I'm in the middle of doing this, in which case I return to step 3.

Just before I'm ready to load library, I do the following:

  1. Remove the tape from waste port B
  2. Drop another 30 μl of flush solution onto the loading port, wait for it to drain through
  3. Drop 30 μl of library onto the loading port, wait for it to drain through
  4. Re-seal the flow cell by rolling my finger across the plastic adhesive cover, trying to avoid putting pressure on the flow cell matrix

2021-Sep-18

Sequencing with low read count / yield

For low concentration sequencing, the best thing to do is spike the sample into a lambda DNA run (e.g. 10% sample, 90% lambda). The lambda will keep the pores occupied, increasing sequencing yield for the remaining samples.

The MinION is a high-throughput sequencing device. Flongle flow cells have an output of 200~2,000 megabases; MinION flow cells have an output of 2,000~20,000 megabases (mostly dependent on sample prep experience). If you only want a few reads, then use Sanger sequencing.

Sequencing with low cost (per sample)

For a single sample, the cheapest possible sequencing cost is $125 (rapid sequencing kit + Flongle flow cell).

But when running multiple samples on the same flow cell, getting below $20 USD / sample is easily possible. As one example, the standard rapid barcoding kit allows multiplexing for up to 12 samples on a single run, so that'd be $10.80 per sample on a Flongle flow cell. We use this kit frequently for plasmid sequencing and assembly, getting >100X coverage from a Flongle run after a few hours of sequencing. It also works well for PCR amplicons that are longer than ~300bp. Flongle flow cells are trickier to get working, but there's a huge cost saving once that process is worked out.

If the sequences are easily distinguishable (e.g. different gene targets), then inputs can be combined without multiplexing, because the different bits can be fished out in software after sequencing. The only limit to that is the output of the sequencer. You could sequence amplicons from a hundred different genes in 12 samples for the same cost as a single gene in 12 samples (with a hundredth of the yield per gene). Twelve targets from 12 samples on a Flongle flow cell would work out to a little under a dollar per target.

There is also now a 96-barcode rapid kit, which can do 6 runs of 96 barcodes on a MinION flow cell (or 12 runs of 96 barcodes on a Flongle flow cell) for $990, so using a bulk-purchased $500 MinION flow cell that works out to be $7 USD per sample, or $11 per sample on a single-purchased $900 MinION flow cell.

This is all out-of-the-box sequencing; no scaling needed.

Getting a bit more technically challenging, it's possible to also do combinatorial multiplexing, using up to 96 PCR barcodes [kit cost $170 per run] in tandem with up to 96 ligation barcodes [kit cost $100 per run], i.e. 9216 samples in one run. Calculating costs is trickier with that, because there are per-sample costs associated with external reagents required for the ligation kits.

2021-Sep-04

Plasmid Sequencing

I follow one piece of advice that @Divya Mirrington gave me about pooling: create a pooled sample with volumes that you're confident about, then remove an aliquot from that for adding the remaining reagents [paraphrased]. I don't do any additional cleanup for purified plasmid DNA; they tend to sequence very well on flow cells without that cleanup.

My key requirement for plasmid sequencing is a concentration of >20 ng/μl (ideally by qubit or quantus). Concentrations over 100 ng/μl should be diluted down. If the plasmids can be diluted down to all exactly the same concentration (but at least 20 ng/μl), or they're all similar lengths, then that makes creating an equimolar pool much easier.

When creating the pools, I add at least 1 μl of the the sample that I need to add the least for (might be more if the total volume is less than 11 μl), then add the corresponding amount of other samples to create an equimolar pool. I then take 11 μl from the pool to be used for rapid adapter addition.

If samples are equal concentration:

Add amounts according to the length of the plasmid divided by the length of the shortest plasmid. For example, if there are two plasmids, one with length 3kb and another with length 35 kb, then add 1 μl of the 3kb plasmid, and 35/3 = 11.7 μl of the 35kb plasmid.

If plasmids are roughly equal length (i.e. less than ~10% length difference between plasmids):​

Add amounts according to the concentration of the highest-concentration sample divided by the concentration of the plasmid. For example, if there are three plasmids, one with concentration 50 ng/μl, one with concentration 35 ng/μl, and one with concentration 20 ng/μl, then add 1 μl of the 50 ng/μl plasmid, 50/35 = 1.4 μl of the 35 ng/μl plasmid, and 2.5 μl of the 20 ng/μl plasmid. The total volume of this pool will be less than 11 μl (1 + 1.4 + 2.5 = 4.9 μl), so in this case I would triple these volumes (3 μl; 4.2 μl; 7.5 μl) to create a pool of > 11 μl.

If samples are different concentrations and different lengths:​

Make the sample prep easier. Use multiple flow cells for different plasmid length ranges. Dilute higher-concentration samples down to the lowest-concentration samples. I don't recommend trying to do both calculations at the same time to determine added volumes because there's a much higher chance of getting added amounts wrong, leading to wasted samples or wasted flow cells.

If you have a sufficiently-accurate pipetting robot, a sample sheet, and someone who is comfortable with equations:​

Pre-calculate amount to add assuming 12 μl total pool volume:

ratio = length / max(length) * max(conc) / conc​

volume = ratio * 12 / sum(ratio)

[That's my guess at the right equations; please let me know if there's an error]

2021-Aug-11

pore blocking

My own experience is that blocking starts becoming a problem when reads are over ~20kb. Nuclease digest and reloading is essential for getting decent reads once there are reads over 100kb.

If long reads aren't necessary, but they exist in the sample, pipette or needle shearing can be used to reduce read lengths prior to adapter ligation (e.g. 1000 μl pipette tip, 50 times).

My current hypothesis is that the physical length vs the sequencing well size plays a role in this. Sequencing wells are 100μm across, which works out to about 300kb as a stretched-out linear strand (or 150kb from edge to centre). Any reads longer than that need to bunch up just to get inside the sequencing well, let alone navigate into the pore. I've noticed that DNA that is more repetitive, especially when there are lots of reverse-complement repeats, tends to be more prone to blocking the pores.

2021-Jun-15

cDNA sequencing

If it's your first cDNA run, you want it done yesterday, and have extracted RNA sitting on the bench waiting to be converted for sequencing, it might be worth having a think about that. Note that it's typical for yield and quality from the first nanopore runs of any new sample to be lower than what is obtained after gaining more experience in sample prep; it'd be worth it to give yourself some breathing space for testing out cDNA sequencing prior to jumping all-in.

Even if none of that's the case, downstream bioinformatics will be easier and faster when using one of the ONT cDNA kits - I recommend the PCR-cDNA barcoding kit (and associated external reagents), because it allows strand-specific analysis of multiple samples from one MinION flow cell.

If you're not using the nanopore cDNA kits, you'll still need at least whole-transcriptome amplification reagents. Assuming you have those, you could probably get away with feeding the amplified products from each sample into the rapid barcoding kit, bearing in mind that this will exclude short transcripts, will probably have isoform detection issues due to shorter sequences, and may have other systematic errors that will be difficult (and time-consuming) to work around during downstream analysis.... And you'll probably need to create your own data analysis pipeline, because long nanopore reads don't work well with short-read workflows, and most long-read workflows expect full-length transcripts.

Another alternative would be using a ligation kit on the amplified PCR products, which would substantially reduce length-based issues. As with the rapid barcoding kit (or any non-cDNA kit, for that matter), you won't be able to do any stranded transcript analysis, which makes it more difficult to distinguish between sense and anti-sense RNA.

2021-Apr-20

Indigenous management of data

While I do agree that open sharing of data is a good ideal, I don't think this should be a hard requirement.

It's not great to take control away from people who don't have much control to start with. Many indigenous groups consider themselves to be guardians of endangered species, and have been historically given little control over what is done with the outputs of research on the things that they want to protect.

To further demonstrate the issues around this, perhaps it's helpful to point out the apprehension that ONT has around releasing their sample preparation protocols publicly (especially including the release of old, obsolete protocols), or the concerns Clive Brown has had in the past about releasing his own genome and/or draft data without restrictions.

There is some more information about indigenous management of materials on the Local Contexts website. This is centred around the idea that there should be metadata tags applied to objects to recognize that there could be accompanying cultural rights, protocols and responsibilities that need further attention for future sharing and use of materials:

https://localcontexts.org/

ONT have responded to this critique by commenting that exceptions to their rules would be considered on a case-by-case basis. The problem with such an approach is that subjective filters tend to favour privileged people, possibly because they have the familiarity with how other privileged people expect the world to work. By stating that public release of data is a 'must have' requirement, people who don't feel comfortable with open sharing of their data are less likely to apply.

Saying something along the lines of "you can ignore some of the rules, and we'll work through that" is not a helpful response to the issue. It creates another additional class of applicants that could be pulled into the project (i.e. those who don't mind ignoring rules in order to achieve their goals) but I'm not convinced they wouldn't have applied anyway.

I liked the approach ONT took with the first MinION Access Program of, essentially, anyone who could write an abstract ended up with a MinION (unless they were a competitor). It would have been nice to know that in advance, in which case it would have been a great example of an ideal application filter: a low bar with consistent, declared requirements.

Questions that are asked on the registration form for Org.one seem like informational questions rather than application filters, which is preferred; this nicely matches my remembered experience of other Nanopore project applications (e.g. the initial MinION Access Project, where pretty much anyone who could write an abstract was accepted into the program).

2021-Mar-06

Modified bases

The fields from the modified base table in the fast5 file correspond to the likelihood of a modified base, according to the methylation model specified in the fast5 file.

"The contents of the table are integers in the range of 0-255, which represent likelihoods in the range of 0-100%"

https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revv_14dec2018/modified-base-calling

Note that these might be inverted from what you would expect, depending on how you interpret "likelihood". The explanation for methylated As might help:

"Given that an A was called, the likelihood that it is a canonical A is ~25% (63 / 255), and the likelihood that it is 6mA is ~75% (192 / 255)."

Also note that these are conditional likelihoods; it's necessary to look at the called base (which isn't included in the table) in order to properly interpret the fields.

FWIW, I've written a script to convert the probabilities from FAST5 files into modified sequences. Using the new 5mc model, it will convert methylated Cs into a lower-case 'm'. It will check to make sure the predicted modification matches the called base, and has a probability threshold above which a methylated symbol is emitted:

https://gitlab.com/gringer/bioinfscripts/-/blob/master/multiporer.py

[Note: this python script needs h5py and numpy]

Usage:

./multiporer.py ~/scripts/multiporer.py fastq reads.fast5

or

./multiporer.py ~/scripts/multiporer.py fastq read_directory

2021-Jan-27

Rapid Barcoding

If you're sequencing from purified mtDNA or long amplicons (e.g. 8kb * 2, or longer than 1kb, for that matter), the rapid barcoding kit works really well.

The rapid barcoding kit produces a higher yield with longer sequences (i.e. >10kb), but the yield doesn't matter as much when sequencing amplicons, plasmids, or other similar short regions. I've found the transposase fragmentation to be essentially random for amplicon sizes down to 600 bp (possibly shorter than that as well).

Reads will be fragmented (because that is required for the adapter attachment to work), but I haven't found read fragmentation to be an issue when looking at assembly and/or variants. As long as the yield is sufficient, the main difference (from a downstream analysis perspective) will be fewer chimeric reads formed during the sample preparation process.

2020-Dec-13

Bubbles

The flow cell is effectively an electrolytic cell, where a voltage applied to a current-carrying solution will cause the breakdown of molecules in the solution (including the breakdown of water into gaseous hydrogen and oxygen). There is also some release of dissolved gases in the solution due to the flow cell heating up.

ONT have previously done all they can to reduce the dissolved gases in the flowcell, but given that this is a normal chemical process, I don't think there'll be a way to completely prevent it from happening.

As long as the bubbles stay small and float above the sensor array, there shouldn't be any loss of capability. If they get bigger and stick to the bottom of the array, then there'll be an obvious drop in available sequencing channels where the bubble is (on the channel view in MinKNOW). The best thing to do in that case is to leave the bubbles where they are until sequencing is finished; bubbles that move around the flow cell will destroy more pores.

2020-Dec-01

Plasmid sequencing

Plasmid sequencing is very easy. I ask people to prepare / purify their plasmid DNA to a concentration of at least 20 ng/μl, and run that product through the rapid barcoding kit.

I recommend that people get a rapid barcoding kit rather than a rapid sequencing kit - I find it a bit odd that ONT still offer the rapid sequencing kit. The preparation process is identical, but you get 12-sample multiplexing for six runs for $50 more.

I like running until I get 200X coverage in the least-covered sample, which usually happens in under half an hour on a MinION flow cell, or 2 hours on a Flongle flow cell.

Here's a video of me attempting to do this demultiplexing protocol for my LinuxConf talk using a 200μl pipette:

Sequencing DNA with Linux Cores and Nanopores

Flongle reloading

Flongle reloading is theoretically possible, but the flow cell construction makes it challenging to do.

The Flongle flow cell had essentially a combined priming / SpotON port, and two waste ports around the outside. The input port is close enough to the flow cell matrix that positive pressure from that port has a good chance of damaging the flow cell matrix. For cycling liquid through the flow cell, it needs to be dropped onto the input area to make a large meniscus, then pulled through the waste ports (one waste port might need to be taped up for that to work).

2020-Nov-20

cDNA Sequencing

RNA-seq is a slower process compared to cDNA-seq (about 1/4 of the speed), can't be amplified, and is a bit more fiddly to work with (reads are sequenced backwards, RNA modifications are present in the signal).

1-2M reads for one sample would be fine, although on a single flow cell it's a little expensive compared to what could be possible.

I prefer the PCR-barcoded cDNA sequencing kit. Our first cDNA runs were 1-2M (a few years ago now), but we're getting more reasonable yields now, and I expect yields will get better once the adapter ATP fix moves through the rest of the kits.

My aim is >1M reads per sample; we can usually get that for most, if not all samples, especially if we do a 20min run with half the library at the start in order to gauge barcode distribution, then rebalance samples based on those results.

I prefer doing around six samples per run, ideally with similar sample types; that seems to be a good sweet spot for high yield. If there are 12 samples prepared as one sequencing library, my recommendation is to run the six best samples first, stop the run when the yield is sufficient, do a nuclease flush, then run the remainder. This ensures that the pores are kept occupied, which increases the overall yield.

2020-Jun-12

Research paper - metagenomic sequencing

Our multi-continent river metagenomic sequencing paper is finally published, see here:

https://doi.org/10.1093/gigascience/giaa053

This is international research (involving river sampling from three continents) that has been over four years in the making. A lot of time was spent on adapting to a rapidly-changing technology, organising and distributing kits and protocols, basecalling and shifting data around, and finding a good way to report results from a diverse range of river systems throughout the world.

If you're wondering why there's something we didn't do, the most likely answer is, "Because the paper was big enough as it is." We enthusiastically encourage others to explore our dataset and carry out additional follow-up work; raw signal FAST5 and FASTQ files are available from ENA via accession PRJEB34137 / ERP116996.

Thank you very much to the amazing international team who has helped with river sampling, sample preparation, data analysis, manuscript writing, and additional editorial work to get this marathon research over the finish line. And also, many thanks to Oxford Nanopore for supporting this project (in particular, Rosemary Dokos and Alina Ham), to the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) service in the UK for facilitating the upload and transfer of raw FAST5 files, and to the Oxford Nanopore community for the many and varied discussions around sequencing techniques, troubleshooting, and technological frustrations. It's a big relief to finally be able to say, "We did it!"

For those interested in replicating our process (which will hopefully take a few weeks, rather than a few years), here's the method blurb from our paper (paraphrased / slightly shortened):

Water samples were taken at 0.5–1 m depth during daylight hours at a time when neither drought nor recent excessive precipitation events occurred within 1 week preceding sample collection. River water (2–4 L) was collected for filtration in sterile collection bottles and was processed immediately or stored at 4°C for prolonged transportation time or until ready for filtration.

The water samples were subsequently processed through a GF/C filter to remove suspended solids, particles, etc. (size retention: 1.2 µm). The water recovered after GF/C filtering was subsequently filtered through a 0.22-µm Durapore filter to capture microorganisms present. Upon completion of all filtering, nucleic acid was recovered using a modified procedure combining enzymatic lysis and purification using a DNeasy PowerWater DNA Isolation Kit.

Subsequently 10–50 ng of DNA was used in conjunction with the Rapid Low Input by PCR Barcoding kit (SQK-RLB001, Oxford Nanopore, Oxford, United Kingdom [now Rapid PCR Barcoding kit, SQK-RPB004]) in accordance with manufacturer's protocols to prepare WGS libraries for use with a MinION device, with minor alterations [2.5μl FRM; 20 cycles PCR; PCR reaction volumes doubled to 100μl]. The prepared library was then loaded into the MinION flow cell (R9.4 [now R9.4.1]) in accordance with manufacturer's guidelines and the unit was run for a full 48 hours of sequencing.

Whole Genome Sequencing reads were processed for basecalling and quality control filtering. Reads were processed using Kraken2, and also uploaded via the command-line API and processed using MG-RAST. Pavian plots of the representative taxa for each metagenome were constructed using the Kraken2 output. Using both MG-RAST and Kraken2 taxonomy results and the Bray-Curtis distance matrix among normalized family counts, PCA was implemented. To evaluate putative ecosystem-related functions from the reads, the MG-RAST server was used to compare data sets to 3 controlled annotation namespaces: Subsystems, KO, and COG proteins. Normalized function data for each river sample were compared using PCA in MG-RAST.

Ngā mihi nui [thank you very much],

  • David Eccles

Pore Depletion

I consider pore depletion over time to be normal flow cell behaviour. The way it was explained to me is that the flow cell works a bit like a battery, in that it becomes less efficient over time.

Nanopore sequencing depends on the ability to measure a flow of ions moving through the pores that are embedded in a polymer membrane. Given that there's no way to get ions back into the loading side of the pores, the difference in ions between the two sides will reduce over time. An increase in the magnitude of the voltage applied across the membrane can compensate for this a little bit, but eventually the difference will be so low that ions won't budge regardless of the applied voltage. And without a flow of ions, there's no sequencing.

Assuming this, the best way to extend the life of the flow cell during sequencing is by reducing the amount of small ions that cross the pores, and the way to do that is to keep the pores occupied. The more actively sequencing pores, the slower a flow cell will deplete, because the flow of "carrying ions" is reduced while long polymers are moving through the pore. Every time a pore unloads and loads again with a new sequence, there'll be another rush of carrying ions, so reducing the number of pore reloads by using longer template polymer sequences will also increase the life of the flow cell.

A flow cell starting with about 65% loading of actively-sequencing strands at the start of the sequencing run means that about 35% of the available pores are unoccupied, and are acting as big holes for lots of carrying ions to move through. When this is noticed, increasing the initial library concentration, or loading more concentrated library, might help to reduce the depletion of the flow cell in such a situation.

Follow-up comment

My knowledge of this has been acquired from discussions with ONT staff during London Calling, and also a presentation that Divya Mirrington gave during a workshop in Wellington last year (who explicitly mentioned and clarified the ion flow model, which led to an "aha" moment for me). I think the battery concept may have been discussed in Clive Brown's presentations as well, but unfortunately I haven't written it into any of my notes.

... there's also this, mentioned in the Wash kit announcement:

Please also note that the common voltage drifts in the course of a run due to the depletion of the redox chemistry in the nanopore array. If you are reusing a Flow Cell that has already run for a certain amount of time, it is advised to adjust the starting potential to maximise the output.

And, looking again now, some further mentions on the community of the flow cell being a bit like a battery.

The main purpose of refueling is to provide energy (in the form of ATP) for the helicase to unwind and ratchet the DNA. It might provide a few more ions, but given that the trans side is not accessible, there will always be an increase in electrical resistance over the course of a run.

With regards to the statement about inactive pores, it really depends on how they're inactive. A decrease of the bright green to dark green proportion over the course of a run might be associated with a loss of available energy, in which case refuelling might work to improve the sequencing rate. If that light-green:dark-green ratio remains constant, but the number of "Unavailable" and/or "Out of Range" pores increases, then it would suggest to me that the pores are getting blocked, in which case a nuclease flush (and library reload) could improve sequencing rates. However, sometimes neither of these situations are happpening, and the apparent issue is an increase in the count of "No pores from scan". This could also be happening due to blocked pores, but I expect it will also happen if there's no current flow for other reasons (e.g. ion buildup on the trans side, pore damaged by chemicals). Sometimes the membrane will get damaged as well and let too much current through, in which case the pore reading becomes "Saturated", and is also unusable.

If you're looking for a DOI to cite, I had a go at summarising and explaining that part of Dr. Mirrington's presentation in an internal talk I gave at our institute in June last year:

DOI: 10.7490/f1000research.1116830.1

I did try getting this into the introduction of a research paper, but one of my co-authors thought it was too simple for an academic text.

2020-May-16

[from reddit]

Accessibility

Sequencing-by-service is not possible on cassava farms in Africa. The turn around time for service-based sequencing is too long to be useful for saving crops:

https://cassavavirusactionproject.com/

Accuracy

Nanopores can produce similar or better results to Illumina for most applications, at a similar or cheaper cost. This doesn't mean that the individual reads have exactly the same per-base quality, just that the end result after downstream data analysis can be similar.

With the new Guppy v3.6 basecaller, it's looking like Illumina polishing is no longer needed. I'm eagerly awaiting an update from Kat Holt's lab to confirm that.

There's a 1D^2 kit that encourages template DNA to be followed shortly after by its reverse complement pair, and other amplification protocols that pre-amplify and ligate copies of template DNA prior to sequencing for getting very high-accuracy reads (R2C2 is one example).

My current belief still remains that accuracy is a software problem, rather than a pore problem. The recent improvements in accuracy demonstrate that (i.e. I've been able to re-call reads from 2017 and get 96% mean accuracy).The base calling will always get better over time, because we will never have a complete model of the physics of a DNA sequence.

Whether or not the current base calling accuracy is good enough is situation-dependent. Where possible, older reads can be re-called using new basecallers (or custom-trained basecallers) to improve accuracy, assuming fast5 files are kept.

The biggest issue for me with colourspace is/was that representing mapping errors is a challenging problem (i.e. how do you demonstrate in an easily-understandable way that ACTCTGTCTCGATCGATC is very similar to ACTCTGTCTATCGATCGA). Related to this, most software tended to map in base space, rather than colourspace. Nanopore sequencing has a similar issue - mapping would be better in signal space rather than base space - but I think it's a more solvable problem because the underlying system (i.e. ion flow signal) contains a lot more information.

Nanopore reads are a single molecule. Pile that up 100 times, and the accuracy is better than 99%. Nanopore error is mostly random, although within that randomness there are a lot of deletion errors. Minor variants can be found / quantified from nanopore sequencing, although I concede that the sensitivity of Illumina for minor variants is higher (e.g. maybe 1 in 10,000 vs 1 in 100 for nanopore) - this is one of the applications where billions of reads are more useful than long reads, and can be important for cancer variant analysis.

Illumina sequences are derived from a consensus sequence of hundreds to thousands of individual molecules; that consensus approach gives the output reads high accuracy. Illumina sequencers don't have sufficient optical resolution to capture fluorescence from individual molecules. Sanger sequencing measures fluorescence of all the DNA molecules that have progressed through the gel and ended up at the same final distance at the end of the run. IonTorrent clusters measure local proton counts of multiple molecules, in an electronic fashion that is similar to a single-channel optical measurement.

Sanger sequencing is also a consensus sequencing method. I would expect that anyone who's attempted to sequence regions containing heterozygous INDELs via Sanger might be comfortable with considering Sanger sequencing to be a consensus sequencing method. When I use an alignment program (such as Seaview), align sequences and carry out the operation "Consensus sequence", it will create a consensus sequence from the component sequences that has "a bunch of noise" [i.e. 'N'] where there is no obvious consensus at that base location. This matches the results that I get from Sanger sequencing (and, for what it's worth, from the automated consensus generated from 2D nanopore sequences back when ONT first commercialised their sequencer).

PacBio can generate consensus reads by repeatedly sequencing the same individual molecule, but so can Nanopore (via linear amplification or rolling circle prior to sequencing), with similar improvements in accuracy. Linear amplification was announced by Nanopore last year, and I expect it will eventually make its way into their official protocols. There are already published protocols about using rolling-circle with nanopore sequencing (e.g. R2C2).

The output from each ZMW in a PacBio sequencer is a single synthesised molecule; if the polymerase falls off, then the sequencing stops. Consensus reads are generated by PacBio sequencers in software, optionally, after the sequencing of each molecule has finished. It's possible to retrieve movie files from each ZMW, and re-call to get the individual components of each circular consensus read.

A consensus sequence generated from individual nanopore reads is higher quality than the individual sequences. We can't ever know the true quality of the individual reads (unless called reads get to 100% accuracy), but the basecalled quality is currently lower than Illumina reads and PacBio circular consensus reads.

The current requirement for multi-read consensus for high-quality is not as much of a limiting factor as it appears, because for most applications where high accuracy matters (e.g. variant analysis, de-novo sequencing), pileup is carried out even for Illumina reads. Where that's not done (e.g. cDNA sequencing + gene counting, metagenomics), nanopore accuracy is usually sufficient to produce similar or better results than Illumina sequencing due to the long reads.

Yield

A typical yield for well-prepared sequencing on current technology is 10-15 gigabases on a MinION flow cell, and 70-100 gigabases on a PromethION flow cell.

For a 6 Mb bacterial genome, MinION sequencing would give 1,000-2,500X coverage. For a 3.3 Gb human genome, PromethION sequencing would give 20-30X coverage.

Unlike flow cells for other sequencers, runs on nanopore flow cells can be stopped, and the flow cells flushed and reused. There's still generally an upper limit of total sequencing time of 48-72 hours, and sequencing yield reduces over that time, but it allows for a couple of small-genome runs to be carried out on used flow cells.

Turn-around Time

The fastest possible time from sample to results is about 20 minutes. That's for a presence / absence test for high-concentration plasmid DNA, prepared using a rapid barcoding kit.

A typical run time for large nanopore runs is 2 days, but because nanopore sequencing is a serial process (pores are recycled; reads are available shortly after the templates exit the pore), data can be analysed during the run, rather than having to wait for the run to finish. Illumina sequencing is parallel (all clusters are extended at the same time), so it's usually not useful to analyse the data mid-run.

Cost

A typical well-prepared MinION run might produce a yield of 10-15 gigabases, for a cost of $1000 USD (down to half that price if buying 48 flow cells in bulk). Assuming a read length of 10kb (e.g. for a small genome assembly), that works out to around 1-1.5M reads - a cost of $600-$1000 per million reads.

An Illumina HiSeq run might produce a yield of 120Gb and reads of length 100bp, or about 800M reads per run. Assuming a run cost of $5000 USD (and I have no idea about Illumina run costs any more), that works out to be about $6 per million reads.

Per-base, costs are more similar. Assuming you're ignoring capital expenditure (which is effectively zero for MinION / GridION, and much higher for Illumina), price-per-base for MinION is similar to MiSeq, and price-per-base for PromethION is similar to NovaSeq. The read length is a big advantage in terms of specificity (e.g. a 1kb read is worth about a hundred 100bp reads in terms of covering genomic sequence, and that's ignoring the ability to span repetitive sequence).

That's for the first run on a flow cell. It's possible to squeeze out more use from the flow cells using a nuclease wash kit, and those subsequent runs are a lot cheaper.

The minimum run cost, although a bit more per base (i.e. $90 for a flongle flow cell + $650 for a rapid barcoding kit that can process 12 runs of 12 samples) makes it more accessible to people who don't have much money.

Base Calling

Illumina sequencing depends on a cluster formed from hundreds to thousands of individual molecules being synthesised in tandem.

Illumina sequencing is possible because most of those individual molecules are lighting up in the same way, and consequently amplifying a fluorescent signal. Phasing / stoichiometric issues mean that those molecules don't all light up in the same way, which puts a limit on the length of the sequence: once synthesis of molecules gets too out-of-phase, the sequencing cannot reliably continue.

Nanopore sequencing works at the single-molecule level. There's no duplication happening; what goes through the sequencer and gets sequenced is a direct sampling of what's put on the flow cell (excluding a few control reads that are used for internal QC).

This allows nanopore sequencing to directly measure the template DNA itself, without any assumptions required around what bases pair with what (as is needed for any synthesis or ligation-based techniques). This near-model-free sequencing* means that the sequencer can capture oddities in the sequence (e.g. from abasic sites, non-standard bases, or methylation), which can be discovered post-hoc and incorporated into base-calling models, increasing base calling accuracy over time.

* near-model-free, because it assumes that the composition of the sequence can be fully-represented by ion flow, a bit like trying to guess the 3D structure of a person based on their shadow, and understanding what people ordinarily look like. A method that would be more model-free would be ultra-high resolution microscopy (e.g. using negative-refraction lenses, optical mirror nanoadjustments, or tunneling electron microscopy).

Data Analysis

For data analysis from called reads, the closest open-source user-friendly-ish thing is probably NanoGalaxy:

https://nanopore.usegalaxy.eu/

For basecalling from raw signal (i.e. generating called reads), I'm not aware of any user-friendly programs, but there are a few command-line programs. The one that seems the most promising to me is Chiron.

Nanopore is now much better on the Open Source aspect of their code, although they could still do more on giving it a proper Free Software license. The code for their newest basecaller, bonito, is available on GitHub:

https://github.com/nanoporetech/bonito

Commercial Use

MinION commercial use is a contractual / legal issue, not a technical issue. The T&Cs that users agree to include a condition that MinION is only used for research, and not for commercial services. People who want a commercial / service license need to purchase a GridION ($50k USD, with ~$50k of included flow cells and sequencing kits to be used in 4 months), or PromethION ($165k USD, with ~$130k of included flow cells and sequencing kits). To add more complication, they can then use a MinION to provide sequencing services.

Aptamers

Aptamer sequencing is definitely an area that needs work. ONT claim that the sequencer will handle very short sequences fine (i.e. below 50bp), but there are software issues with that: the signal scaling issue that affects normalisation for basecalling (i.e. what you mentioned), and that the MinKNOW control software will skip over and not output any reads that are too short (this is, IMO, a more concerning issue, because the reads can't be recovered and re-called in the future).

One way to deal with the length issue for aptamers is to include a random ligation step prior to sequencing. There are also various methods to pre-amplify along a similar vein to PacBio's HiFi reads (e.g. linear amplification, rolling circle) which can be used to increase template accuracy.

Alternatively, short reads can be annealed to longer adapter sequences. This is the approach taken in this recent paper.

Cancer

Generally, nanopore is a lot better for large structural variations (e.g. deletions of hundreds to thousands of bases, or chimeric recombination events) because of the longer read length. Single-nucleotide changes have reasonable accuracy with nanopore, but the applications matter, and software that distinguishes between forward and reverse-complement mapping (important for nanopore variant discovery) is rare.

Looking for cancer needles in whole-blood haystacks is likely to be challenging with long-read nanopore sequencing; that's one situation where massive numbers of reads are actually helpful. This issue can be reduced a bit via cell-free sequencing, or shorter reads, or targeted sequencing, but that involves a loss of information (i.e. you need to know what you're looking for before you can do the sequencing).

cDNA sequencing

Nanopore devices can definitely be used for cDNA sequencing (and, for that matter, true native RNA sequencing). Accuracy concerns are a moot point for cDNA gene/transcript counting once you get above 90% (and Nanopore is near 95% with their most recent base callers). All that really matters is whether or not a sequence can be reliably assigned to its originating transcript.

It's my opinion that Nanopore cDNA sequencing runs have comparable (or possibly better) sensitivity and specificity than Illumina, with a lower cost, faster turnaround time, and true isoform-level results. This is because there's a lower noise floor (i.e. at most 2-5 reads for Nanopore vs ~100 for Illumina) which compensates for a lower read count (e.g. 1M reads for Nanopopore vs 40M reads for Illumina). In addition to that, the longer reads make it more likely that mapped reads will uniquely hit an isoform, and more likely that mapped reads cover the entirety of a transcript. I've got a graph showing different Illumina read lengths vs Nanopore here:

https://twitter.com/gringene_bio/status/1047698591987322882

Install Base

The entirety of the Sequel II install base has a comparable throughput to a single P24 PromethION. According to Albert Vilella's chart, there are 45 PromethIONs out in the wild, and 4500 MinIONs (compared to 125 PacBio Sequel IIs).

Drawbacks

  • The amount of effort I have to put into convincing people to give them a try. Oxford Nanopore rely mainly on word of mouth to market their products, which means they rely a lot on the unskilled marketing ability of research scientists (such as me). Their staff will go to places to give talks, but generally only when they've been invited. I get money by analysing other people's data, so it's beneficial for me to put in that effort.

  • Heated discussions with critics who keep spreading misinformation like, "it'll never be used in a clinical setting", despite years-old published research contradicting those statements. I tend to assume that those statements are personal opinions / expectations, so generally try to avoid directly refuting or bringing attention to those statements.

  • People tend to equate "cheap" with "easy". It's a very different process from other sequencing, even ignoring that labs tend to get sequencing done as a service, rather than doing it themselves. Nanopore offer training, but it's very expensive, so most people try to muddle through themselves (hence the tendency for sequencers to collect dust).

  • Their sample prep protocols are not properly versioned, and have lots of traps for naive users (especially users who aren't used to working within a sequencing service facility).

  • Software for working with nanopore reads is very new, tends to be command-line based, and hence is not very user-friendly.

  • MinION sequencers (the $1000 ones) can't be used to provide commercial sequencing services to other people [Nanopore's response is that other people should just buy their own MinION].

  • The cost for GridION and PromethION is so low that it sets up "too good to be true" suspicions for funding (the commercial cost of flow cells included in the sequencer purchase covers the cost of the sequencer).

  • Technology updates. By the time a particular technology gets into a published paper, it's a common occurrence that Nanopore has moved on and created a better thing (and has "archived" / deleted their protocols for the published technology).

  • Data storage / data transfer, especially for PromethION. The raw signal for a read is 5-10 times the size of the associated called fastq file. A fully-loaded PromethION has a theoretical limit of a few terabases of sequence per day, so storing the raw signal for that sequence is not a trivial exercise. Keeping that raw signal is important, because it allows for recalling old runs with much higher accuracy and/or incorporating additional base change models (e.g. methylation).

  • ONT is a bit over-eager in their estimations of sequencing yield. I think they should be reporting median user yields, rather than their own internal yields.

2020-Apr-23

[from reddit]

PCR

A while ago, a group of people discovered that there was a particular bacteria, Thermus aquaticus that could survive in hot springs that were around 70°C. This bacteria, naturally, needed to be able to copy its own DNA in order to survive. The protein to do that, Thermus aquaticus DNA polymerase (usually shortened to Taq DNA polymerase) was isolated in 1976 by Alice Chien (and two other researchers). This high-temperature protein (or, more correctly, a modified form of it) is a key part of the polymerase chain reaction (PCR) that is used as a core element of the most common COVID-19 diagnostic tests.

In the cells in our body, DNA is usually copied by the following rough process:

  1. A group of proteins latch onto the DNA and unzip it, locally separating apart the two strands that are stuck together
  2. Primer sequences bind to the DNA
  3. DNA polymerase latches onto both the primer sequence and the DNA strand to be copied, then extends the primer by adding additional blocks that are complementary to the blocks on the original DNA strand

That first step (i.e. unzipping and breaking apart the DNA, also called melting) can also be done by high temperature, as I've already explained. This high-temperature melting idea, combined with a high-temperature DNA polymerase forms the functional loop of the polymerase chain reaction:

  1. Heat (about 95°C) is used to melt the DNA template to be copied, which is also in solution with a mix of primers
  2. The solution is cooled down (to about 65°C), which allows the primers to attach to the DNA template
  3. The solution is heated up again to the ideal temperature for the DNA polymerase (about 72°C), which latches onto the primer sequence and the DNA strand to be copied, then extends the primer by adding additional blocks that are complementary to the blocks on the original DNA strand

This technique is called a chain reaction, because it can be carried out multiple times. Each heat/cool/extend cycle [assuming there are enough available primers and building blocks, and the primers can find DNA template in time] will double the amount of template DNA in solution. Starting from a single viral particle, 32 cycles of PCR (in optimal conditions) will create (or amplify) about 4 billion copies.

More recently, researchers have worked out how to attach fluorophores to DNA primers (also known as fluorescent tags, or probes), which release photons of specific light wavelengths when they're exposed to light of a different wavelength. A commonly-used fluorophore for the COVID-19 diagnostic tests is a fluorescin amidite (usually shortened to FAM), which absorbs cyan light of wavelength 494nm, and emits green light of wavelength 518nm.

These fluorophores are combined with quenchers on the DNA primers, which chemically interact with the fluorophores so that they won't light up in solution. When the DNA polymerase tries to extend the primer sequence, it chops off the quencher, which allows the fluorophore to activate. By blasting the solution with light of the absorption wavelength, and measuring the amount of light produced at the emission wavelength, it's possible to get a pretty accurate idea of how many tags have been incorporated into the amplified DNA.

This is how most of the diagnostic PCR tests work for COVID-19: when a sample doesn't have any viral sequence, no amplification happens, and no green light is observed. When a sample has even a small amount of viral sequence, that gets amplified to billions of copies, producing a very obvious green light.

Primer Design

DNA is a very long stringy thing. It's sticky, really sticky. Ordinarily it sticks to a shadow copy of itself (which we call reverse-complement DNA; "reverse" because there's a physical inversion that happens with the chemical structure).

[There's a similar structure called RNA which is even more sticky, so ordinarily sticks to itself, rather than to a reverse-complement shadow copy. For simplicity I'll just write about DNA, but the general ideas here apply to both]

At a molecular scale, DNA is made of long strings of four building blocks, which we symbolise as A, C, G, and T. The shadow copy, or complements, of these bases are, respectively, T, G, C, and A (i.e. the sharp, straight letters stick together, and the roundish letters stick together).

If you heat it up in a solution, it gets less sticky and breaks apart into single strands. If you put some shorter reverse-complement DNA into that solution and cool it down again, then there's a chance that those shorter pieces will stick to the DNA (instead of, or in addition to the shadow copy). We call these short sticky pieces of DNA primers.

DNA sequences are mostly random, but not quite. We call DNA sequences high complexity when they look random (e.g. CTGTCTATCCAGTTGCGTC), and low complexity when they don't (e.g. AAAAGGAAAAGCTAAAAAA). It just so happens that low complexity sequences, if they exist at all within a really long DNA sequence, tend to exist in many different places. If you're designing a primer to stick to a particular region of DNA, and that region includes low complexity sequence, then the primers will probably also stick to lots of other different regions as well.

One of the difficulties in designing primers is that the composition of DNA sequences changes how sticky they are at particular temperatures. It is generally, but not always, the case that longer sequences will start sticking at lower temperatures (you could see this as the longer sequences have more attachment points). For example, TGGAGCTAAGTTGTTTAACAAGCG and TTCTCCTAAGAAGCTATTAAAATCACATGG have very similar sticking temperatures (but quite different lengths), whereas AGTGAAATTGGGCCTCATAGCA and CGCAGACGGTACAGACTGTGTT have quite different sticking temperatures (but are both of length 22).

The chemical reactions we use to carry out genetic research work best at a specific temperature, so there's a balancing act in choosing primers that are sufficiently long that they stick to the specific DNA sequences of interest, but short enough that they will stick and unstick at a high enough temperature.

The polymerase chain reaction (PCR) that is used as a core element of the most common COVID-19 diagnostic tests (which I'm choosing to not explain here for the purpose of brevity), requires two primers to work properly (one at each end of the target region). The ideal temperatures for PCR mean that these primer pairs are often too short to be specific enough for a really accurate detection of the existence/abundance of a target region, so one or more additional tagging primers can be included as well.

[by the way, all of the sequences I've used as examples here come from the SARS-CoV-2 viral genome]

2020-Jan-16

Short 16S fragments

The shorter a fragment is, the less likely it is to be able to distinguish different organisms from similar phylogenies. The benefit for 16S vs Illumina is demonstrably better with 1.5kb reads (e.g. see this paper). But between 250bp (the maximum useful read length for MiSeq, or arguably 300bp) and 1.5kb, there will be a range where MiSeq will be better due to the higher accuracy of a 250bp read balancing out the long read advantage.

When fragments are shorter than 250bp, and cost/time-related factors are not considered, MiSeq should always be better at classification due to having higher single-base accuracy for cluster-amplified templates.

Short tagmentation

Why would short (100-500bp) amplicons not work for Nanopore sequencing? I can think of at least two reasons:

  • The tagmentation may be sequence-specific, so for short amplicons there is an increased chance that the target sequence is not present in the amplicon.

  • Base calling is more difficult on shorter sequences. The current basecallers require a reasonable length of sequence to determine the current range / distribution for scaling the electrical signal. When the sequence is short, that range might be skewed and throw off the basecaller. This is a software problem and should be fixed some time in the future. One thing that Clive Brown mentioned in the December Community Meeting was the possibility of scaling based on adapter sequence signal (which should have a much more consistent electrical profile).

This will be dependent on the template DNA; it's not a guaranteed failure. I think the best thing to do is just try it (e.g. with a washed, used flow cell) and see if it works.

2019-Dec-04

Shotgun Metagenomics

The costs are easier to calculate for nanopore sequencing because it's all publicly available on their store. The rapid PCR barcoding kit [$650 USD] is probably the best for quantitative shotgun metagenomics, because it gives a reasonable yield in terms of number of reads, but still has a simple sample preparation process (~15 minutes hands-on time + PCR).

For metagenomic assembly, longer reads are better, so the non-PCR rapid barcoding kit [$650 USD] is a better choice. Reads are fewer, but longer (e.g. 15~50kb, vs ~2kb with the rapid PCR kit). Sample prep is fast: I can do it in about 10 mins in the lab, and in about 30 minutes when I'm nervous because Linus Torvalds and Jonathan Corbet are watching.

The biggest benefit of nanopore sequencing is the low per-run cost. The cost of both rapid kits are similar, because the flow cell and kit costs are identical. Yields are also similar; I'd expect 1-5 Gb from the first few runs with these kits, and maybe 5-10 Gb with a bit more experience. The main costly external reagents are Ampure XP beads (or a cheaper alternative) and Tris-buffered saline (10 mM Tris-HCl pH 8.0 with 50 mM NaCl). Excluding those, it's a sample kit cost of $9/sample for a fully-loaded run (12 samples). Standard MinION flow cells range in cost from $500 to $900 USD each (depending on how many are purchased in bulk), so the maximum per-sample cost for a fully-loaded run of 12 samples from ONT reagents is about $85 USD (run cost of $1010). But....

  • One of the many neat things about nanopore sequencing is that you can monitor the run when it is sequencing and stop when you've got enough data. It's sometimes the case for metagenomic studies that a few hours of sequencing (or maybe even a few minutes) will be enough. In which case the run can be stopped, flushed with a nuclease wash kit [$100 USD for 6 flushes], and re-used with another sample.

  • Lower-yield "flongle" flow cells are also available. These are $90 each, and with currently-seen yields might be able to squeeze out enough reads over a day if a few minutes on a standard MinION flow cell is good enough. With 12 barcodes per run, that'd be a per-sample cost of $17 from ONT reagents. At the moment, it usually works out cheaper flushing a MinION flow cell that has been used for something else, but that might change in the future.

  • Higher sample multiplexing is possible with longer sample preparation workflows. Those workflows also give higher yields (5-15 Gb, possibly up to 30Gb depending on the cleanliness of the sample, and the nano-fu of the person doing sample prep). There's a ligation-based kit that works with up to 48 barcodes (as of 7 December 2019), and a PCR barcoding kit that has a plate full of 96 barcodes. One of the newest things at the moment that people are trying out is dual-barcoding with inner and outer barcodes, which gives 4608 combinations when the PCR barcoding kit is paired up with 48-barcode ligation.

2019-Sep-17

Sequencing Dynamics

Nanopore sequencing preferentially sequences short reads, so these will turn up in excess of what is observed via gels or similar quantitative methods.

Adding a bit of DNA ladder into the sample (where templates have a known molarity) could be used to generate and apply a standard curve to read counts.

The problem with small molecules is more than just molarity; physics has a role in sequencing as well. The mass (or equivalently the momentum) affects how quickly a polymer can be sucked into an open pore. Smaller sequences are more agile, so will be more quickly affected by rapid changes in current (e.g. just after a sequence is ejected and it is in the "open pore" state). Imagine a 100m drag race between a 50-wagon train, a 18-wheeler truck, and a hatchback car. The hatchback car will usually get to the finish line first because there's a lot less mass to get moving in order to pick up speed.

If a sequence is really long, then the physical environment plays a role in sequencing as well. The sequencing wells are 100μm in diameter, which works out to about 300kb of linear DNA sequence. Long sequences either need to coil up to fit into the well (increasing the chance of blocked pores), or be draped across multiple wells (increasing the likelihood of the sequence snagging on the edges of the wells).

Base Calling

The most recent guppy works well for R9.4 data (i.e. from about January 2017). The biggest difficulty is converting the single fast5 files into bulk files. I expect that it will also improve base calls on R9 data (which has a similar electrical profile), although I haven't tested that yet.

Amplicon Sequencing

As far as I'm aware, the current 1D2 kit would not be a good idea for amplicon sequencing (including 16S), because sequence similarity and length are used to determine if one read is linked to the next read. If all the reads are almost identical (as would be the case for amplicon sequencing), then this linking would erroneously associate every read that has a previous reverse-complement 1D sequence.

ONT's proposed solution to this issue is to add a Unique Pair Identifier (UPI) tag sequence to the 1D2 adapters, and detecting / matching up that sequence when pairing, but I don't think those kits are available yet.

2019-Sep-16

Read Errors

The error profile for nanopore sequencing is not consistent (unless you want to count deletions on long homopolymers, and methylation), and it's getting more random as ONT improves their algorithms. That's a good thing, by the way, because it makes it more likely that single-molecule errors will be smoothed out on consensus.

Here's an example of that, where I'm showing all the single-molecule variation for reads mapped to the mitochondrial genome of Nippostrongylus brasiliensis. The black lines indicate the read coverage, purple and cyan dots represent insertions and deletions respectively (adding or removing from the coverage at that location), while the blue, yellow, and red dots near the concentric axis are the transition, transversion, and complementary variants respectively from the reference genome. Dots get bigger as they get further away from their baseline, and their radial location relative to the distance between the concentric axis and the coverage line represents the proportion of variation contributed by that variant. Reads mapped to the forward strand appear on the outer chunk of the donut, and reads mapped to the reverse strand appear on the inner chunk of the donut:

Mitochondrial genome variant plot for Nippostrongylus brasiliensis

I haven't been able to see any consistent patterns of variation within this genome. That doesn't mean they're not there; maybe I just haven't looked hard enough.

2019-Sep-13

[More] User-friendly Software

Galaxy has a nanopore analysis section. Here's a tutorial for using nanopore + Illumina data together for creating a bacterial assembly in Galaxy:

https://galaxyproject.github.io/training-material/topics/assembly/tutorials/unicycler-assembly/tutorial.html

You'll probably find that interesting programs will hit the command-line first (because that's the easiest to write code for), and bubble up to Galaxy when a developer gets interested enough to add them to the available tools.

2019-Sep-12

Rapid Barcoding

The rapid barcoding kit (SQK-RBK004) can be used with a single sample. The protocol is identical to the rapid sequencing kit (i.e. no bead cleanup), except you use one of the barcoded fragmentation mixes instead of the non-barcoded fragmentation mix.

2019-Feb-19

You can't avoid transposome-based fragmentation with the rapid kit (as found in the fragmentation mix); that's the whole point of the rapid kit.

The rapid adapter and helicase protein is clicked onto the incorporated adapters from the transposome. See the 'workflow' tab for the rapid adapter kit in the ONT store:

https://store.nanoporetech.com/kits-248/rapid-sequencing-kit.html

ONT provides a protocol selector in the ONT community forums with a drop-down for kits. If you choose the native barcoding expansion [EXP-NBD104] from the drop-down, it comes up with a protocol for 1D Native barcoding of gDNA.

Yes, this requires "a bunch of additional NEB products". Any other barcoding you're doing will probably also require similar reagents. It's a pain to source them, particularly if you live in New Zealand or another remote country. However, sequencing yield is much higher with the ligation-based kits, which can substantially offset the cost, depending on your application.

There's no ONT-specific barcoding protocol for SQK-RAD004 because it's not a barcoding kit, and hasn't been designed to work with any barcoding solutions. If you want rapid barcoding without external reagents, get the rapid barcoding kit (SQK-RBK004), bearing in mind that it has a lower yield:

SQK-RBK004 Rapid Barcoding Kit

2019-Jan-25

If you want to use the MinKNOW software without pinging ONT, I recommend sending an email to support@nanoporetech.com with reasons.

The MinKNOW software pings ONT before the start of every run, and won't start a sequencing run unless it gets a valid response back. There are other alternatives: buying MinIT, or waiting 6 months and getting a MinION Mk1c.

I don't think using MinKNOW requires an account with ONT, but it does require an internet connection with access to non-standard ports. In addition to pinging, I think it will also download run scripts and firmware prior to sequencing (both of which are necessary to run the sequencer). At the end of the day, flow cells and kits still need to be ordered from ONT, so there's not really any practical way to do nanopore sequencing without talking to them.

2018-Nov-21

Active Channels / Mux

QC information is available by looking at the bream log files:

$ grep -e 'group 1.*active' -e 'bulk file name' bream-*.log

The group 1 count matters the most to me, rather than any particular mux. The pores are loaded by flowing them onto a chip according to a poisson distribution; some wells get none, some get more than one. ONT tries to load the flow cells to ensure that they have the best chance of getting 512 channels with only one loaded pore in any of the four muxes.

There's an ideal proportion of active channels in each mux that corresponds to this. Based on pictures I've seen (for flow cells that look good), I'm going to guess that it's about 75% in any one mux. There will be a mathematical formula to work this out precisely.

I've since found out that ONT has a clever trick involving electrical current that allows them to do better than a poisson loading of the flow cell. Pores won't load into the membrane unless the electrical conditions are favourable, and loading a pore changes the electrical profile of the sequencing well. By disabling the "pore goes here" signal when a pore load is detected, they can more reliably fill sequencing wells without overloading them.

Flow Cell Storage

ONT claims a shelf life of at least 2 months for the flow cells. There have been people on the ONT community running unused flow cells that were in their fridge for over a year. Just make sure you keep them above 4°C (i.e. not a frost-free domestic fridge), because the pores get destroyed by ice crystals.

If in doubt, store in a cool cupboard; they are designed to handle room temperature for reasonably long periods of time.

2018-Oct-27

RNA / cDNA read mapping

We're getting what looks to me like good results from anything over 200k cDNA reads per sample. Here's my approach for doing stranded mapping to a transcriptome using LAST.

As far as I'm aware, LAST won't recognise the 'U' from RNA reads, so they'll need to be converted into 'T's first.

2018-Oct-26

Linearising DNA

Our experience is that plasmids will sequence fine with the rapid prep. The transposase seems to work on circularised DNA as well as linear DNA. However, PCR is more efficient on linearised DNA. If you have a low quantity of starting DNA that needs to be amplified, cut it first.

Identifying a single sequence in a set of reads

I do this sequence searching frequently, and use either minimap2 or LAST for it.

LAST is trainable and allows a bit more control over alignment scores. Minimap2 is easier to use, because it doesn't require an index to be generated first, can handle both compressed and uncompressed files, and doesn't care about whether references or queries are fastq or fasta format. Assuming minimap2 is installed (and in the search path), you've got sequenced reads in a file 'reads.fq', and your query sequence in a file 'query.fa':

minimap2 reads.fq query.fa

This requires your query sequence to be in fasta format, with a header line starting with '>' containing a description of the sequence, and the next lines containing the sequence. Line breaks are allowed in the sequence, but every line except the last should be the same length:

>tig00023107:400944-401709#500-760
ATAGAAAAGATGAAACGGTAAAGTAAAGGCGATTTCGATACAAAGAAAGA
TGAACAGAATAGTCAGGTACAATCTCGGTAGTCTTAGAGTTACTGTTAAC
ATAATTAAattaaaataattagcataattaaactAGTCATATTCAGTCGA
ACGACTGAACAGTTCAAGACTACTGTAACCTCGACACTGAAAAATTCCAA
CAAAGTGATGATAATTCCAGAATAGGCAATAATCCTAGTCGTTCGAGGTG
ATTTCGCAAA

If you're searching for a 20-40nt sequence, LAST will probably work better. You can extract out sequences from a BAM file as fasta using samtools fasta. The samtools help suggests removing supplementary alignments (presumably so that sequences are not duplicated):

samtools fasta -F 0x900 in.bam > reads.fa

From there, generate a LAST index for the reads. Because it's a fastq file, the '-Q 1' option is needed:

lastdb -Q 1 reads.fa reads.fa

Then mapping the prepared fasta sequence can be done as before:

lastal reads.fa query.fa

LAST will produce output with '#' at the start of lines, even when sequences are not present, but nothing else will be produced if there are no matches.

2018-Oct-25

Flow Cell Washing

It is possible to wash the flowcell and use it to sequence several times. Information about how much of the previous run is carried over in an ideal situation can be found on the wash kit page of the store.

Note that the flow cell probably won't get better after a wash. It will keep sequencing, at best, at the same rate it did prior to the wash. Due to this, I wouldn't advise rewashing a flow cell used for whole-genome/transcriptome human sequencing and then trying to use it again for whole-genome/transcriptome sequencing.

If you're doing transcriptome/genome studies on large genomes, either barcode the samples and run together or use separate flow cells.

Starter Kit

The starter kit includes two flow cells in the initial $1k purchase. It's recommended by ONT that you first do a 6h run with lambda DNA to get used to the sample preparation and sequencing process. While you don't strictly need to do lambda on your first run, it's a good idea to get acquainted with the process of sequencing.

If your own sample is not precious, go ahead and use it on your first run, but be prepared that you'll probably have a low-yield run and won't know why. If that happens, one of ONT's first suggestions will be that you run some lambda to demonstrate that your sample preparation process is okay.

Barcoding

RNA can't be barcoded at the moment, as far as I know.

There are two rapid barcoding kits, one uses PCR and one uses transposase fragmentation. It depends on your application which one will be best.

For gDNA, I recommend the transposase-based rapid barcoding kit, if you can handle a lower yield. I haven't observed any sample-prep based ligation with that kit, which makes the bioinformatcs side of things much easier & more robust.

For cDNA, we're using the PCR rapid barcoding kit (using rapid attachment primers that click into the adapters). Unfortunately the cDNA kit only includes PCR amplification primers for six samples, whereas that barcoding kit can be used for up to 72 samples (i.e. 6 runs of 12). We're still trying to sort out ordering / using our own custom primers for that.

The ligation-based barcodes will give you a higher yield (particularly with fragmented / sheared DNA). I expect that most of that increase is due to cleaning stages in the sample preparation process. However, the adapter ligation steps used after barcoding can create physical chimeric reads, which can be a pain to deal with in downstream analysis.

2018-Oct-06

cDNA Sequencing and Run Yield

Our last cDNA sequencing run produced 7 million reads with an average length of about 800bp; more than enough coverage to get good transcript depth for three bulk samples. For doing single cells, you'd probably be able to do up to about 1000 cells on a single MinION flow cell and still get reasonable results.

The highest yield (outside ONT) so far from a MinION is 19 Gb, and the new Series D flow cells (which give more consistent output over longer periods of time) should see that approaching 30Gb with no additional chemistry changes.

While the usual recommendation for human is from 10M to 25M reads per individual sample, that's with short Illumina reads. Nanopore gives you full-length sequences with fully-resolved isoforms. Also, the standard Illumina sequencing protocol will mix in unprocessed transcript, which means that intronic sequence also comes through in the reads (and is wasted / confounding).

The long read advantage is substantial. From the work that I've been doing with cDNA, we've been getting reasonable results for the mouse transcriptome from anything over about 150k reads per sample. I've just put up a one-sample example dataset on Zenodo, run in August last year. I think it's about 1M reads (which can be sub-sampled to see if anything changes).

Both Illumina and long-read data have issues with low-expressed transcripts. The only way you'll be able to reliably pick up low-abundance transcripts is with a microarray, because then the high-abundance transcripts won't swamp out the hits for other transcripts.

There's a lot more noise in Illumina SBS reads which means that it's difficult to make reliable calls for fewer than about 100 reads for a given transcript. The noise floor for nanopore is a lot lower: not more than 5 reads from the data I've been looking at, and it could be as low as two reads. As an example, here are two genes that are at the lower end of expression on our MinION runs, one (Adh1) that is not expressed. To suggest that this is expressed in the Illumina samples would require justifying why the unannotated regions outside the gene are expressed at a similar level:

Adh1

And one that has two reads oriented in the same direction as the gene. I have trouble convincing myself that these reads don't originate from the annotated transcript:

Ankrd39

When that is taken into account, together with the already mentioned confounding factors, 10-25 million reads on Illumina isn't that far off 1M reads on nanopore.

And that's ignoring the long read advantage for transcripts and isoforms (i.e. "bases covered"), a lack of short-transcript bias, and gene network correlations that can be used to adjust zero-count data (commonly used for single cell sequencing).

There's only one transcript per read, if you're using the standard strand-switch method for cDNA sequencing. There are approaches like R2C2 which incorporate multiple passes of one transcript. Compared to PacBio, nanopore can produce more reads per flow cell, especially for shorter reads of the order of 1kb (e.g. cDNA) - even the new flow cells have an upper limit of 8M reads; we got 7M reads from our last cDNA run.

Barcoding RNA comes under "feasible, but ONT haven't implemented it yet". Given that there's already a chunk of custom sequence in the adapters, it's not much of a stretch to imagine that a barcode could be added to that.

A particular downside of native RNA sequencing is that the sequencing is slower, so it'd be worth thinking about whether you really want to do RNA. You also have to concern yourself with the quirks of RNA: it degrades faster than DNA, and has a habit of folding up and twisting into knots that make sequencing a little bit more difficult.

Yield will be higher, and sequencing will be more reliable with cDNA. I'd recommend only using native RNA if you need it (e.g. for RNA modifications).

Read Errors

MinION can detect if a base is repeated multiple times in a row. When the number of repeated bases is 5 or greater, the base caller can both over and underestimate the reference sequence. For what it's worth, PacBio also has problems with long homopolymers, which makes me wonder how much of it is a representation of the true sequence, and how much is introduced by the caller.

Up to a year or so ago the basecallers collapsed any homopolymers longer than 5 bases into 5 bases, but the reads can be re-called using more recent base callers to recover the missing bases. Nanopore's most consistent / systematic error now is inaccurately calling the length of long homopolymers (over 5bp), especially when methylation-aware consensus calling us used (which gets rid of most of the systematic base call error due to methylation).

Nanopore sequencing is an observational method of sequencing; there's no degradation of error based on length. If this were the case, then it wouldn't be possible to match a 2.3Mb sequenced read to the human genome. PacBio read length is limited; even 100kb would be pushing it on a 10h run.

Flow Cell Re-use

To determine if a read was a residual molecule from a previous run or the current library, wash the flow cell to get rid of most of the old reads, barcode the reads, and dump any reads that have no barcodes, old barcodes, or multiple barcodes. It's not great to re-use a flow cell for high-throughput analysis (e.g. large whole genome or whole transcriptome sequencing). Stick to bacteria, viruses, and amplicons for re-using flow cells.

Even ONT admits that washing won't completely remove the previous reads.

The problem with re-using flow cells is that many people think a flow cell can be fully-recovered by washing, and try to use it again for high-throughput sequencing. The washing [only] removes most of the old bound sequence, so that it interferes less with additional library; washing can't recover dead pores.

I frequently re-use flow cells for sample QC and low coverage / amplicon sequencing. If you're only interested in read length distribution, a few thousand reads is plenty. When you're sequencing a 10kb plasmid or 16kb mitochondria, only getting 10-100 Mb out of a run is plenty.

2018-Feb-19

Chimeric Reads

I've seen chimeric reads in every nanopore run I've done. In some cases, they are created in-silico, and can only be identified by looking at the raw signal associated with the sequenced reads. This open pore signal is what the raw signal looks like when there is no DNA or adapter engaged with the pore. "Open pore" is ONT's terminology. Between each read, there is a brief period of time when the pore needs to change from one adapter to another. At this time the pore is "open" and electricity requires a much higher current in order to bridge the gap.

When the raw signal contains a break, it suggests that the sequencing software didn't properly break up the sequence into two separate reads. There's a threshold for the dissociation time, and if it's a very quick reloading then two reads can end up in the same FAST5 file, with the basecaller ignoring the stall signal and calling as a chimeric read.

Particularly for the R9 pores, the motor protein will sit on the pore for a short while before processing the DNA. That's the stall signal: a situation where there's DNA in the pore, but it's not moving. If there's a long stall region in the middle of a read, it's likely to be a chimeric read that has formed in-silico. If that's the case there should also be a spike at the start of each read as well. Sometimes DNA will also stall for a bit within a sequence if the pore current has been reversed (and the DNA/adapter has refused to eject). The current reversals are very obvious: they have a huge spike, then a huge drop in the other direction directly afterwards, possibly followed by an incorrect base current level.

2018-Jan-30

How I began in Bioinformatics

I talk about my early research history here [talk script].

But my desire to look at things, see how they work, and change them started in my childhood. I learnt how to use DOS batch files when I was about 8 years old, and had a memory sniffing program that allowed me to cheat when playing computer games. That led to a bit of tinkering with save game files, and messing around with different binary file formats.

I started teaching myself how to program in Pascal when I was about 14, and a friend helped me out with setting up a website a year or so later.

I've sort of always been tinkering, but it was helped greatly by formal education in computer science at University. I took up biology to challenge myself (it was my worst subject at school), and have been fond of genetics because it has a lot of similarities with mathematics and computer programming.

2017-Oct-17

Working as a Bioinformatician

I've always had a somewhat stable job I could fall back on, which generally comprises about 40-60% of the work I do. I'm also struggling a bit in finding additional work, because everyone wants full-time employment, which doesn't fit in with my mode of work. I'm much more productive if I'm able to work on a broad range of projects, because there are the occasional flashes of insight where I can use a solution from one project in a completely different domain. I haven't needed to advertise my services all that much, with most of my jobs obtained through word of mouth.

However, time management for project-based work can be very challenging. It has meant that I've had some nail-biting situations work-wise in being overloaded with work (and not being able to complete jobs on time), and some times where I've had to go for a month or so without income (hunting out and prospecting for work).

Doing work on software projects (e.g. Trinity, Canu) has been incredibly beneficial in increasing my knowledge and capability, but not so great in getting more paid work. That's partly because some job contracts are biased against sole-trader businesses (e.g. requiring $10M liability for damages).

t's frustrating that English doesn't distinguish between free-as-in-facebook and free-as-in-freedom. It's the second one that specifically applies to free and open source software, although most FOSS developers choose to do both. I ask for payment for development time and bug fixing, rather than use. If I'm not doing any work myself, it feels wrong for me to charge money.

2017-Sep-29

Alignment

LAST is quite good for showing differences between individuals nanopore reads and a reference sequence, or between two different nanopore reads. The HTML output can be loaded up in LibreOffice Writer and exported to a PDF file.

2017-Apr-29

Ribosomal DNA repeat regions

A couple of years ago the MinION had less than 1/100th the yield it has now, and max read length was forced to about 35kb. It's substantially better now (read length limit is effectively removed); but will still be difficult getting 600kb reads due to sample prep issues.

However, we have had some success in assembling some ribosomal repeat regions using Canu. I don't expect it would be any different from other long-read sequencing. Just be careful with sample prep, don't use PCR and avoid pipetting as much as possible.