PLANT GENETIC RESOURCES AND GENOMICS: MAINSTREAMING AGRICULTURAL RESEARCH THROUGH

Background Study 5 PLANT GENETIC RESOURCES AND GENOMICS: MAINSTREAMING AGRICULTURAL RESEARCH THROUGH GENOMICS Norman Warthmann This document has bee...
Author: Louisa Willis
21 downloads 0 Views 3MB Size
Background Study 5

PLANT GENETIC RESOURCES AND GENOMICS: MAINSTREAMING AGRICULTURAL RESEARCH THROUGH GENOMICS Norman Warthmann

This document has been produced by the request of the Secretariat of the International Treaty and in the context of the first expert consultation on the Global Information System on Plant Genetic Resources for Food and Agriculture to stimulate the discussion on genomics and to facilitate the consideration of appropriate technical and organizational linkages during the development and implementation of the Global Information System.

www.planttreaty.org

1

Author: Dr Norman Warthmann, Lecturer in Plant Biology, Genetics and Genomics at The Australian National University

The designations employed and the presentation of material in this information product do not imply the expression of any opinion whatsoever on the part of the Food and Agriculture Organization of the United Nations (FAO) concerning the legal or development status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The mention of specific companies or products of manufacturers, whether or not these have been patented, does not imply that these have been endorsed or recommended by FAO in preference to others of a similar nature that are not mentioned. This study reflects the technical opinion of its authors, which is not necessarily those of the FAO, or the Secretariat of the International Treaty on Plant Genetic Resources for Food and Agriculture. © FAO, 2014 FAO encourages the use, reproduction and dissemination of material in this information product. Except where otherwise indicated, material may be copied, downloaded and printed for private study, research and teaching purposes, or for use in non-commercial products or services, provided that appropriate acknowledgement of FAO as the source and copyright holder is given and that FAO’s endorsement of users’ views, products or services is not implied in any way. All requests for translation and adaptation rights, and for resale and other commercial use rights should be made via www.fao.org/contact-us/licence-request or addressed to [email protected].

2

NOTE FROM THE SECRETARIAT This study is available on line at: http://www.planttreaty.org/content/background-study-paper-5 BACKGROUND 1.

Through Article 12.1 of the Treaty, Contracting Parties agreed to facilitate access to plant genetic resources for food and agriculture under the Multilateral System and in accordance with the provisions of the Treaty.

2.

Among the conditions of the transfer, Article 12.3.c. of the Treaty states that “All available passport data and, subject to applicable law, any other associated available nonconfidential descriptive information, shall be made available with the plant genetic resources for food and agriculture provided”.

3.

Article 12.4 of the Treaty provides that facilitated access under the Multilateral System shall be provided pursuant to a Standard Material Transfer Agreement (SMTA), which was adopted the Governing Body of the Treaty, in its Resolution 1/2006 of 16 June 2006.

4.

Article 3 of the SMTA states: “The Plant Genetic Resources for Food and Agriculture specified in Annex 1 to this Agreement (hereinafter referred to as the “Material”) and the available related information referred to in Article 5b and in Annex 1 are hereby transferred from the Provider to the Recipient subject to the terms and conditions set out in this Agreement.”

5. Article 17 of the International Treaty states that “Contracting Parties shall cooperate to develop

and strengthen a global information system to facilitate the exchange of information, based on existing information systems, on scientific, technical and environmental matters related to plant genetic resources for food and agriculture". 6. At its Fifth session in Muscat in September 2013, the Governing Body of the International Treaty

adopted the Resolution 10/2013, Development of the Global Information System on plant Genetic Resources in the context of Article 17 of the International Treaty, and requested the Secretary to call for an expert consultation. 7. In preparation of the expert consultation scheduled on January 2015 in San Diego, California,

USA, the Secretariat has requested the preparation of this study as a technical input. 8.

The present document is intent to bring light to the importance of plant genomics for food and agriculture and present some suggestions for the consideration of technical experts and does not intent to make recommendations on the decisions that the Governing Body will need to take, but to provide information and technical analysis that may help identify both problems and opportunities, and so support the Consultation in its task of providing advice to the Secretary for the Development of the Vision that will be later on presented to the Governing Body in October 2015.

9.

The author would like to thanks the Treaty Secretariat for this opportunity and have invited comments from other experts to further elaborate this preliminary study exploring the role of genomics in its potential impact in the development of the Global Information System.

3

Plant Genetic Resources for Food and Agriculture and Genomics : Mainstreaming Agricultural Research through Genomics

Crop improvement is facilitated by harnessing the gene pool of the species and related species to find genotypes and recombine genes to deliver superior plant performance in agriculture, food, energy and biomaterial production. Henry, R. J. (2011). Next-generation sequencing for understanding and accelerating crop domestication. Briefings in Functional Genomics.

I believe plant breeders and geneticists will drive the next agricultural revolution via the web by sharing the phenotypes and genotypes of crop plants using a system that can store, manage, and allow the retrieval of data. Zamir, D. (2013). Where have all the crop phenotypes gone? PLoS Biology, 11(6), e1001595.

But the real revolutionary potential in this method lies in its power to open up the genetic bottleneck created thousands of years ago when our major crops were first domesticated. Goff, S. A., & Salmeron, J. M. (2004). Back to the future of cereals. Scientific American, 291(2), 42–49.

4

Table of Contents Introduction ...................................................................................................................................... 6 Motivation .................................................................................................................................................... 6 The opportunity - The genomics revolution .................................................................................... 7 The chance.................................................................................................................................................... 9 The challenge .............................................................................................................................................. 9 Genomes and genetic variation ..........................................................................................................10

Genomics .......................................................................................................................................... 12 DNA Sequencing .......................................................................................................................................12 Technologies and machines ............................................................................................................................ 13 Sequencing strategies........................................................................................................................................ 15 SNP genotyping .................................................................................................................................................... 18 File formats............................................................................................................................................................ 19 Data Analysis - Genomic information ............................................................................................... 22 Assembly vs. re-sequencing............................................................................................................................ 22 Genome assembly ............................................................................................................................................... 23 Genome assembly quality ................................................................................................................................ 27 Genome Re-sequencing .................................................................................................................................... 28 The Transcriptome ............................................................................................................................................. 31 Transcriptomics - Gene Expression............................................................................................................. 33 Epigenetics............................................................................................................................................................. 34 Data sharing...............................................................................................................................................36 Data sharing - Technical issues ..................................................................................................................... 36 Data sharing - other issues .............................................................................................................................. 38 Cyberinfrastructures for Analysis of Genomic Data of Plants .................................................41 transPLANT ........................................................................................................................................................... 42 The Integrated Breeding Platform (IBP) ................................................................................................... 43 Other platforms ................................................................................................................................................... 45 Relevant Initiatives .................................................................................................................................46 DivSeek .................................................................................................................................................................... 46 Global Alliance for Genomics and Health (GA4GH) .............................................................................. 47 African Orphan Crop Consortium................................................................................................................. 48

Impact of Genomics on Plant Genetic Resources for Food and Agriculture ............ 49 The impact of genomics on genebank management ...................................................................52 The impact of genomics on plant breeding ....................................................................................55 The impact of genomics on pre-breeding ....................................................................................... 60

Recommendations ........................................................................................................................ 64 Bibliography ................................................................................................................................... 68

5

Introduction Motivation The cost of genome sequencing has fallen one-million fold in the past several years. It is now inexpensive to gather genome sequence information in large numbers of individuals in timeframes much shorter than any crop’s life cycle. In principle, this wealth of genome sequence data should accelerate progress in plant breeding, and thereby help to combat hunger and malnutrition. Integrating the genomic information with crop performance, i.e., plant phenotypes, environment (weather, climate, pathogens) and management practices should transform breeding from being an art to a predictable science. Aggregating and analysing large amounts of genomic and phenotypic data across many environments and treatments would enable to connect genotypes to phenotypes, discover patterns that otherwise remain obscure, and even predict crop performance, enabling smarter choices and faster breeding. In practice, however, we are not yet organised to seize this extraordinary opportunity. Currently, for the most part, data are collected and studied on a per experiment basis: very focused, under specific circumstances, often with unique material and hard to reproduce. The data remains isolated by crop, by environment, by year, by institution, by company, by country, etc. and is also analysed in isolation often with sample sizes too small to make robust discoveries given the amount of environmental variables. Current procedures in plant breeding do not allow for widespread comparison across studies and the sharing of information. It is hence difficult, merely impossible, to learn across datasets, experiments and breeding trials. The genomic information in its universality can serve as a nucleus and focal point for a much needed integration. When drafting Article 17, the fathers of the ITPGRFA probably did not quite anticipate the radical technological developments that occurred and are occurring, however, they did appreciate the value of data aggregation and sharing. A Global Information System, as called for in Article 17, if implemented with foresight and as soon as possible, will put us on a path to take full advantage of this genomic revolution. At present, relatively little data on PGRFA have been collected. In absence of an open and interoperable solution, closed, proprietary systems might be created. This would create a fundamental barrier to reaping the benefits of data aggregation and sharing and would hence slow progress. It should be pointed out that there is another field that is currently revolutionised, i.e., disrupted, by the new genomics approaches: biomedicine. Here the goal is to reveal the genetic basis of cancer, inherited disease, infectious diseases and drug responses. Biomedicine is currently also in the need to build an information system for sharing genomic and phenotypic data. Leading researchers in the biomedical community responded to this task in 2013 with the formation of, what is now called, the “Global Alliance for Genomics &

6

Health (GA4GH)1”, aka “Global Alliance to Enable Responsible Sharing of Genomic and Clinical Data”. Their white paper states: “[…] the Global Alliance aims to foster an environment of widespread data sharing that is unencumbered by competing, proprietary standards, […]. By creating a standardized framework for sharing and using genomic data, the Global Alliance will enhance the opportunities for broader study of a range of diseases while also improving information sharing globally”2. This Alliance adopted a constitution3 in Sept 2014 and currently (Oct 2014) has 191 Institutional members from 26 countries, including Google, Inc.. Google in turn recently launched a platform: “Google Genomics”4, which has the potential to revolutionise the field. Other developments in this area include the “Public Population Project in Genomics (P3P)5” and Sage Bionetworks6. These initiatives are launched because groups of individuals are convinced of the urgent need and tremendous opportunity. In contrast to the human medical research community, the PGRFA community is in the favourable position that it has already been agreed at the highest level to develop and implement a Global Information System. Article 17 of the International Treaty on Plant Genetic Resources for Food and Agriculture (ITPGRFA7) provides a framework for it, and a recent survey conducted by the Treaty Secretariat in 2014 indicates that the number of research institutions with genomics programmes on PGRFA is growing, and with it the need for coordination. While the specific challenges in biomedicine and plant breeding will be different, the underlying organising principle of genomic information and the need to compare genetic and phenotypic variation are the same; for some crops even at the same scale: the human genome and the maize genome have the same size.

The opportunity - The genomics revolution The biological sciences are currently undergoing a revolution. The genomics revolution is a DNA sequencing revolution. DNA sequencing is a process in which the genetic information, the genome, of an organism is deciphered, i.e., read, letter by letter. Genetic information contained in the genome is the instruction for life and reading this code is now accessible to everyone. In the past 10 years the cost of DNA sequencing has fallen several orders of magnitude and Figure 1 illustrates this cost decrease per raw megabase. Incremental improvements to Sanger-type DNA sequencers produced a moderate reduction in sequencing cost since its invention in the 1980s, and it was the 1

http://genomicsandhealth.org Global Alliance for Genomics and Health, White paper (2013) 3 http://genomicsandhealth.org/ga-constitution-about 4 https://cloud.google.com/genomics/ 5 http://p3g.org 6 http://sagebase.org/ 7 FAO. International Treaty on Plant Genetic Resources for Food and Agriculture. FAO. (Retrieved from ftp://ftp.fao.org/docrep/fao/011/i0510e/i0510e.pdf in July 2014) 2

7

advent of 2nd generation type DNA sequencers in the year 2007 that caused a dramatic drop in price. Main novel features of the 2 nd generation sequencing machines were that they sequenced DNA in a highly parallel fashion and that they operated on complex mixtures of DNA molecules as templates. Hence, anything that has been written about the application of genomics prior to 2007, be it in human medicine, nature conservation, or agriculture, requires revision. Not so much on the chances and opportunities genomics will provide, but certainly on time scales, project sizes, project variety, and Research & Development priorities.

Figure 1: Cost per Raw Megabase of DNA Sequence 8 As dramatic as the cost reduction was since 2007, close examination of the graph in Figure 1 reveals that the cost reduction has slowed down, plateaued and the cost even increased in recent years. This seems to be due to a combination of technical limitations and economic considerations of the machine manufacturers and technology providers involved. It may mean that further revolutionary improvements to 2 nd generation sequencing technology are unlikely and the future price reductions will be mainly from incremental improvements. At the same time, the next generation of DNA sequencing technology –so-called 3 rd generation– is emerging, but it will likely be a few more years until the technology reaches maturity and suitable analysis tools and capacity are in place to fully capitalise on the 3 rd generation sequencing machines. Hence, genomics will make its big impact in the next few years through a combination of the broad, decentralised application of 2 nd generation DNA sequencing technology supplemented by data from more centralised 3 rd generation DNA sequencing techniques. 8

Source: http://www.genome.gov/sequencingcosts

8

The genomics revolution poses chances and challenges. This document will give an introduction to why and how genomics will make difference in the conservation and use of PGRFA and will highlight the main challenges a Global Information System on PGRFA will need to address in relation to Genomics. The impact genomics will make is global and local, and most of the challenges revolve around fostering worldwide cooperation and interoperability. But because genomic information is the blueprint to life and the basis of inheritance, attaching genomic information to accessions and other PGRFA material and incorporating genomic information into the Global Information System should make the challenge more straight-forward to address rather than more difficult.

The chance The opportunities that genomic characterisation will bring to the conservation and use of Plant Genetic Resources have been spelled out in detail frequently in the last 15 years.9 The novelty brought about by the recent advances in genomics is that there are now fast and cheap methods to assess the genetic makeup of an organism, down to base pair resolution, if desired. Large numbers of individuals can now be assayed within timeframes shorter than the lifespan of any crop plant.

The challenge Acquiring genomic data is cheap, especially the re-sequencing of genomes. There will be an avalanche of data from re-sequencing studies on PGRFA. The challenge is to establish the framework for data aggregation and sharing; in general, and crop specific. Despite the clear benefits of data integration, effective procedures are not yet in place to enable the widespread sharing of information and comparisons across studies on PGRFA. The consultation process established by the Governing Body of the International Treaty for the development of the Global Information System foreseen in Article 17 of the ITPGRFA may help to strengthen commitments and to trigger those procedures, which will generate clear benefits for plant breeding. The genomics revolution is not expected to make information sharing more difficult, but rather easier. Genomic information holds the promise to unify the type of information and approaches and to enable the integration of information across disciplines.

9

see for example Tanksley, S. D., & McCouch, S. R. (1997) and McCouch, et al. (2013).

9

Genomes and genetic variation It is important to realise that the relevant genetic information and variation is not as vast as it may seem and certainly not intractable. As of this year, an estimated 228,000 human genomes have been completely sequenced by researchers around the globe and the number is expected to double every 12 months and reach 1.6 million genomes by 2017. The price of sequencing a single genome has dropped from the $3 billion spent by the original Human Genome Project 13 years ago to as little as $1,000. “The bottleneck now is not the cost—it’s going from a sample to an answer”10. The human genome has a (haploid) size of 3 billion base pairs, which is larger than many crop genomes. The 1000 Genomes Project Consortium reported in 2012 on the genetic variation detected by re-sequencing 1092 human genomes.11 They list 38 million single nucleotide polymorphisms (SNPs), 1.4 million short insertions and deletions (InDels), and 14,000 larger deletions. Their samples where derived from 14 populations and particularly sampled as to maximise diversity. SNPs are Single Nucleotide Polymorphisms, which means that one letter of DNA code is changed. For example at a particular position in the genome one individual might have an “A” while another has a “G”. A “deletion” is a variant where one or several bases are missing, an insertion when there are extra basses inserted. Whether or not such a variant is an insertion or deletion obviously depends on from what perspective it is viewed. An insertion in one individual can be called a deletion in the other, hence they are often denoted as InDels, which means both or either. InDels can be very large. Within a species, large parts of the genomes of individuals will be identical. As seen in the human genome example of the 1000 genomes project above: in a diversity maximised sample of highly heterozygous genomes, 38 Million SNPs in 3 billion bp is about 1 variant in 100 bp. For genetics, it is merely these differences between the genomes that are of interest. In addition, most of the genetic variation within a species is shared. This means, not every individual or variety or cultivar (whatever the unit is that is compared) harbours unique genetic variation. For the most part, an individual is the combination of common variation, which presents itself as haplotypes, which are blocks of linked genetic variants. In the case of self-fertilising crop plants, in contrast to humans, the level of heterozygosity is expected to be low. This is certainly true for the mega-varieties of our major crops and it is a feature of a uniform crop, that within a cultivar all individuals are identical. This is probably less so for landraces. Landraces may harbour residual heterozygosity and haplotypes will have frequencies in the population different from 100%. Those frequencies can change from year to year in the field, but nonetheless, the genomes of individuals within a landrace will be a combination of common haplotypes. It is this combinatorial nature of genetic diversity that allows geneticists to detect patterns, and enable Genome Wide Association Studies (GWAS). They do demand, however, to compare large numbers of individuals and cultivars with each other and, because the genetic makeup interacts with the environment, in as many environmental conditions as possible.

10 11

Regalado, A. (2014, September 24) McVean, et al. (2012)

10

Genomic information will allow to uncouple the haplotypes from the particular individual or variety that was analysed. This makes studies comparable at the haplotype level, even when different sets of individuals and cultivars were analysed. Different experiment will certainly also differ in the environmental conditions the plants experienced. This has always been the challenge of crop phenotyping. Environmental conditions are difficult and often impossible to control. The hope is, that this can be accounted for by large sample numbers and monitoring the actual environmental conditions. Recording high-resolution genomic data on PGRFA in a Global Information System will allow to integrate data across experiments, which will facilitate reaching the sample sizes needed to make robust discoveries given the amount of environmental variables.

11

Genomics DNA Sequencing A plant’s nuclear genome is organised in chromosomes, which are very long DNA molecules; each is millions and dozens of millions of base pairs (bp) long. So far there is no sequencing technology capable of producing sequence reads the length of an entire eukaryotic chromosome. At present there exist high-throughput machines that produce short reads, which are generally shorter than 400 bp, and “long read” sequencing technologies, which can read up to several thousand bp of continuous sequence. The different sequencing technologies have their own unique strengths and are used complementary to one another. Short read sequencing is cheap, widely available and hence easily accessed. Many universities and research institutes around the world possess their own machines. In addition, there exists an industry of commercial sequencing service providers. Figure 2 shows a map with 2558 machines situated in 920 sequencing centres.

Figure 2: World Map of High-throughput Sequencers12 12

by James Hadfiled and Nick Loman. For an interactive map of global sequencing capacity see http://omicsmaps.com/- (last accessed November 2014).

12

About 90% of the world’s sequencing data today is produced using Illumina’s short read technology13. These high-throughput machines produce large amounts of reliable data that readily capture SNP and small InDel variation, which accounts for the bulk of the variation within a species. Even though the reads are short, molecular biology and bioinformatic procedures have been developed that are capable of assembling larger tracts of sequence; see chapter “Genome Assembly”. Long read sequencing is currently more expensive and preparing the samples for sequencing is more challenging. Capacity in this area is mainly found in Europe, the USA, Japan and China at specialised research institutes and commercial sequencing service providers. The technology is hence also widely accessible. The main technology in use in mid-2014 to produce long reads was Pacific Bioscience. Despite the high error rate of the current long-read sequencing process, long reads are valuable, because they enable reading through complex genomic regions and provide higher confidence and precision for calling structural variants, which are large insertions, deletions and rearrangements. With long read information, variants are more easily phased into haplotypes and, for building reference genomes, they improve assemblies into even longer scaffolds, providing order and closing gaps. Long and short reads together provide a very comprehensive view suitable for most genomes. For exceptionally large, repetitive genomes, reduced representation sequencing strategies might be the most cost effective.

Technologies and machines Currently, mid 2014, there are really only three sequencing technologies available and widely used: Illumina, Ion Torrent, and Pacific Biosciences. The main method of choice is high-throughput short-read sequencing using Illlumina sequencers, complemented by long-read sequencing with PacBio sequencing machines. DOE/JGI operates in this mode since 201214, while the Beijing Genomics Institute (BGI) uses Illumina almost exclusively for its sequencing, with >128 HiSeq2000 machines between the Shenzhen and Hong Kong sites. 13 14

Regalado, A. (2014, September 24) 2013 DOE JGI Progress Report

13

The most recent account of the history of 2nd generation DNA sequencing instrument development of the last 10 years can be found in McPherson, J. D. (2014).15 In addition, Sarah Ayling reviewed the different sequencing technologies from a practical perspective in 2013 for the DivSeek Initiative16.

Short Read sequencing technology

Illumina/Solexa Originally developed by Solexa, but later purchased by Illumina, this is the cheapest technology currently available in terms of price per base pair. The Illumina Genome Analyzer IIx and HiSeq2000 are widely used, and can produce 95 and 600 Gb of data per 11-day run, respectively. Illumina recently released the MiSeq, a bench top type sequencing instrument. Illumina machines can perform single-end and paired-end runs, where one or both ends of the same DNA fragment are sequenced. Paired-end offers a significant advantage. In genome assembly, the paired reads should be correctly oriented relative to one another and within a certain distance. The error rate is

Suggest Documents