9 min read

Painting and Bioinformatics

I'm straying from my usual short-form diary-ish entries to talk about my current evangelistic, philosophical world view. This is not about Christianity (although given that it's my "world view", I could easily argue that Christianity is included in it), nor is it about bioinformatics, nor painting. These are just picture hooks that I'm using to hang things on, so that I can relate my own general idea to other things that other people are more likely to understand.

Painting

But why painting?

Because I feel that painting is the quickest way to explain this approach to tackling the world's problems. There's a video that I really like which I think demonstrates this really well, it's a video of someone speed painting in 1 hour using a Free/Libre program called Krita:

Speedpainting Timelapse, Krita 2.8

I love a lot of things about this video. I'm not going to list them all, but I'll highlight a few that I think are key to the points I want to get across.

1. Everything is created out of something that already works

It could be said that the painting starts with a blank slate, but that's not quite correct because there are things that have been prepared in advance of the painting. If nothing else, the computer program exists as a pre-created environment that many other people have devoted time and effort into improving. The "blank slate" is an already working thing: a featureless image.

2. The working thing is changed in small steps

I love that the painting starts out with broad strokes. These are simple painted lines that I imagine I could create myself. Even if the end product is beyond my own expertise, I can see how it is created by laying small changes over the top of the existing painting.

3. Changes can break things

About 11 seconds into the painting, the artist realises that a darkened blur is the wrong size. They make it smaller, and it breaks the painting, creating something that looks worse overall, but better in the area that the artist is working on. We realise later on that this darkened blur is the main subject of the painting, so it makes sense that the artist cares a lot about getting this bit right.

4. Broken things can be fixed

The broken painting is not a large issue, because the artist understands how to recover from a broken product. They create additional strokes to improve the painting at regions where there are issues, and once it looks okay overall, they get back to improving other areas of the painting.

5. Improvements can always be made

The artist is limited by the one hour they have to make the painting, so there is a fixed end-point for their work. But you might notice that the thumbnail image for this video actually has additions: text and a speech bubble. The general shape of the image seems to me to be there after about 35s, and the artist is happy enough to save a snapshot after about 1m05s (after clipping and smearing the edge). But the artist doesn't stop there; they keep adding until it's good enough.

Bioinformatics

But why bioinformatics?

Because bioinformatics is my working life at the moment. I have found myself frequently applying these painting ideas to the coding work that I'm doing:

  1. I start with something that works
  2. If it doesn't do what I want, then I change the code to tell it what I want it to do.
  3. These changes frequently break the code.
  4. I fix and debug the code so that it works again.
  5. If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2.

This approach applies all over the place in the things that I do. I've mentioned it a couple of times on Twitter, for creating a wind turbine, and for creating a microfuge tube earring.

Start with something that works

If I start with something that is close to what I want to end up with, and it already "works" (whatever that means), then the effort required to create the thing that I actually want is substantially less.

A Working Blog Post

This document is an example of that. One of the trickiest things I find about writing posts on this web site is the appropriate construction of the header line. It looks like this:

---
title: "Painting and Bioinformatics"
author: "David Eccles"
date: 2020-02-23
---

But I haven't memorised that. I did not create that from scratch. In fact, this blog post has been constructed out of the same philosophy that I'm trying to explain.

I didn't need to do a web search to find how to do put headers into my posts (although that would be something I could have done), because I've previously written other posts on this web site. In this case, I copied the header information from my Chaos Fund post, changed the title and date to more appropriate values, then deleted all the rest.

... then I fixed up the bugs associated with the changes I'd made, because I used "Feb" instead of "02". But that was a much easier fix than starting from nothing.

Bioinformatics - Part 2

But why bioinformatics?

Because I am not a painter; bioinformatics is my working life at the moment. I see bioinformatics as the process, or art, of converting biological research outputs into something that can be better understood by other researchers - research outputs that typically, but not always, involve very large datasets. One of my favourite explanations of bioinformatics is that it’s a bit like surfing: chasing waves of information from an ocean of data, and presenting them in an interesting way before they reach the shore of public knowledge.

I have found myself frequently applying these painting ideas to the coding work that I'm doing as part of bioinformatics projects:

  1. Start with something that works
  2. Change the code to tell it what I want it to do
  3. These changes frequently break the code
  4. Fix and debug the code so that it works again
  5. If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2

This approach applies all over the place in the things that I do. I've mentioned it a couple of times on Twitter, for creating a wind turbine, and for creating a microfuge tube earring. Those aren't exactly bioinformatics, but the more physical representation of 3D models makes it easier for me to explain this process of gradually building code that works. But here... I'm going to dig a bit deeper and talk about a small bioinformatics task I've been working on.

1. Start with something that works

If I start with something that is close to what I want to end up with, and it already "works" (whatever that means), then the effort required to create the thing that I actually want is substantially less. I have some code that generates a plot of repetitive information in a DNA sequence. Explaining in detail what the plot represents takes a while, so interested people can have a peek at my presentation on the topic.

In any case, I have an image. This is the way I check to make sure that my code still works, or one of the ways that I check to find out what is broken:

[REPAVER plot of an assembled haplotype from a chimpanzee (Pan troglodyte); sequence was created as part of the Vertebrate Genomes Project, combining sequence information from PacBio, ONT, 10x, Bionano, Dovetail, and Illumina reads]

2. Change the code to tell it what I want it to do

In this case, what I want it to do is to run faster. The code was slower than I wanted it to be, taking over five minutes to process and generate the above image. I wanted it to be faster, and I expected that my code only needed a few little tweaks to fix that problem.

More specifically, I had code that did something like this:

  1. Start with a hash result of 0
  2. Convert the next base in the kmer to a 64-bit hash
  3. Shift that base hash a position dependent on the base location within the kmer
  4. XOR the shifted base hash with the current result
  5. If all the bases are processed, stop. Otherwise, return to step 2

And wanted it to do something like this:

  1. Start with the hash result of the previous kmer
  2. Shift that result one position
  3. Remove the value of the base that is no longer seen
  4. Add in the value of the new base

[see an explanation of the algorithm here]

This was changing an operation that worked on all bases within a kmer into an operation that only works on the first and last bases within a kmer. When the algorithm is running on hundreds of millions of locations within a chromosome, and the kmer size is moderately large (125bp in my test case), small changes like that can make a big difference in the run time of a program.

3. These changes frequently break the code

... and that's okay.

[git diff of the initial implementation of fast hashing, including additional debug output to show when the fast hash doesn't match the slow hash]

In this case, I encountered situations where my attempts at creating a fast hash led to broken code, in other words, code that didn't produce the correct output. The code above represents the state after I made a few tweaks and is technically correct (i.e. it produces the correct output), but it takes even longer than the initial implementation because it compares the fast hash process with the slow hash process, and uses the slow value if they differ. This was not ideal.

4. Fix and debug the code so that it works again

The point of this step is to return the code (or the thing) to a state that means it is once again usable. Ideally, a state that is better than the original state, but not necessarily the same as the goal. In my case, that meant working through the bugs enough that the initial forward repeat hashing was complete, and sufficiently fast (i.e. taking under a minute to complete), but the remaining code was still slow:

5. If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2

This is an iterative process. In the process of fixing things, I often encounter new bugs. I might discover that the speedup is not actually as fast as I had expected, so I need to hunt around for other solutions, like a fast hashmap library which does the even lower-level stuff a bit quicker. Eventually, after many iterations, I got to a stop point; I'd had enough.

There are more things that I can do with this code, but I'm happy with it... for now. I successfully reduced the processing time from "It'll be done when I get back from my break" to "It'll be done after I check a couple of emails".

Summary

This "painting" approach of iterative development, accepting temporary failure as a necessary part of the process of improvement, extends into many different areas of my life. As long as I can keep little pockets of success along the path to my goals, the setbacks along the way can be weathered.