Technology is accelerating bioinformatics needs, again, while the current need isn’t diminishing.
This post was meant to be just a brief snapshot aimed at students wondering where bioinformatics is going in 2008 and beyond. What’s the future of bioinformatics? What kind of focus should you develop in the near future? What kinds of skills will you need?
More after the jump, below.
Recent (but probably solvable problems) in bioinformatics include:
1) Pipelines that generate new de novo copies the human genome as part of massively parallel sequencing projects. I’m not talking about just aligning bits of nucleotide sequences to a reference sequence, but also using that information to completely re-assemble each new genome, variations and all.
These algorithms must yield, as part of their task, fast and efficient alignment of massively parallel sequence reads (50-bp and above). I believe that 50-100bp analysis will become important in the near future. For a nice review of some of the recent developments, see the review page on massively parallel sequence alignment by Heng Li at Sanger
Alignment problems are partially solved by programs such as ELAND, MAQ,SSAHA2, etc — there’s another complication to add to the mix: be both fast AND efficient. Some systems are drowning in the flood of next-gen sequencing so all you compbio efficiency geeks who are also interested in being bioinformaticians, you have a big open playing field there, even for those of us with clusters, the flood of data will be overwhelming.
So, to regenerate an individual human genome sequence, any pipeline containing these methods must find the most likely location of such reads and on top of this, determine A) single nucleotide (SNP) genetic variation B) copy number variation and C) large-scale chromosomal rearrangements from such reads against what is certainly (at this point) once again the draft copy of the human genome. Which brings me to —
2) Integration of sequence and genomic data: Enter the concept of the “Human Statistical Genome”. You will need statistics. How do you integrate data across many different individual genomes? How do you integrate functional genomic information back to these genome sequences?
Who’s to say which chromosomal inversion is really atypical or rare with a minor sample of the population? A thousand genomes can be sequenced, but it might not be wide enough for an outbred population like humans. Or, we might have a hard time estimating the true number of variations from limited samples
How do we estimate a sequence representing the ‘most popular’ human genome as a new true reference? Our human genome could be replaced from a nice modular set of (mostly) contiguous chromosomes to a draft copy which contains probabilities at each nucleotide position representing the polymorphism in the population. The highest resolution for the human genome might not stop at the HapMap level (sequence blocks)…it may reach down to nucleotide level as we discover more about human genetic structure.
3. Microarrays. We love them, we hate them. They plague us with variance and give us insights nevertheless. Will microarrays be replaced by sequence methods? Or are things like ChIP-Seq going to become standard? Are there standards in microarrays? Will microarrays be around in 10 years, or will everything fall to massively parallel sequencing? Will microarrays stick around because of their precedents, and because they’re simple to work with and don’t require huge computational facilities to process their data?
4. Computational Systems Biology. The ENCODE project, though not directly a sysbio project, is doing a lot for the field by showing that the big picture is a lot more complex than the early attempts at modeling could hint at. CSB is pretty much where it was a few years ago. This isn’t to say that it’s not going anywhere and that systems biology is ‘dead’…far from it. You’ll find that systems biology — or if you prefer, integrative biology — is just in that low slow lingering dawn right before a summer day.
5. Medical informatics. Integration of clinical data back to biomedical/experimental data. There’s so much to say here — and it’s still in its infancy. If you have any good resources on this, please post them in the comments.