Biologia das Comunicações volume 6, Número do artigo: 577 (2023) Citar este artigo
192 acessos
2 Altmétrica
Detalhes das métricas
O mapeamento genético para identificar genes e alelos associados ou causadores de variações de características quantitativas economicamente importantes em animais de gado, como porcos, é um objetivo importante no melhoramento genético animal. Apesar dos recentes avanços nas tecnologias de genotipagem de alto rendimento, a resolução do mapeamento genético em suínos permanece pobre devido, em parte, à baixa densidade de sítios variantes genotipados. Neste estudo, superamos essa limitação desenvolvendo um painel de haplótipos de referência para porcos com base em 2.259 animais inteiros sequenciados pelo genoma representando 44 raças de porcos. Avaliamos combinações de software e composição de raças para otimizar o procedimento de imputação e alcançamos uma taxa média de concordância superior a 96%, uma taxa de concordância sem referência de 88% e um r2 de 0,85. Demonstramos em dois estudos de caso que a imputação de genótipos usando esse recurso pode melhorar drasticamente a resolução do mapeamento genético. Um servidor web público foi desenvolvido para permitir que a comunidade de genética suína possa utilizar totalmente este recurso. Esperamos que este recurso facilite o mapeamento genético e acelere o melhoramento genético em suínos.
O porco doméstico (Sus scrofa) é uma importante espécie pecuária e um organismo modelo para pesquisas biomédicas1. Historicamente, a domesticação e a intensa seleção artificial criaram muitas raças de suínos geneticamente e fenotipicamente distintas entre si e de seus parentes silvestres2,3,4. Mais recentemente, tecnologias de sequenciamento de DNA e genotipagem5 de alto rendimento facilitaram o melhoramento genético de porcos. Por exemplo, centenas de associações genômicas amplas e estudos de mapeamento de locus de traços quantitativos (QTL) identificaram numerosas regiões genômicas associadas a vários fenótipos de produção, fisiológicos e comportamentais6. Esses estudos são importantes para entender a base genética e biológica de características econômicas e biomédicas importantes, como crescimento7, fertilidade8 e resistência a doenças9.
A resolução do mapeamento genético em porcos permanece pobre devido em parte à baixa densidade de matrizes de genotipagem de polimorfismo de nucleotídeo único (SNP). Uma abordagem comprovada e econômica para superar a limitação na resolução é por meio da imputação de genótipos, aproveitando o desequilíbrio de ligação para inferir genótipos em loci polimórficos não observados10. Com grandes painéis de referência de haplótipos criados por sequenciamento de todo o genoma, a imputação tem o potencial de fornecer genótipos em nível de sequência11. Em animais de produção, onde a identificação de QTL e a predição genética são dois objetivos principais, e o desequilíbrio de ligação é extenso, a imputação de genótipo em nível de sequência foi aplicada com sucesso com um número relativamente pequeno de haplótipos de referência, mas precisão decente12, 13. Em suínos, em particular, pelo menos dois servidores públicos de imputação estão disponíveis14, 15. No entanto, eles continham um número muito limitado de animais no painel de referência14 ou careciam de boa representação das principais raças comerciais15, limitando suas aplicações. Além disso, embora muitos estudos tenham demonstrado melhorias na resolução do mapeamento16 e na precisão da previsão genômica17, nenhum deles pode ser acessado publicamente.
Neste estudo, produzimos dados de sequência do genoma completo de 1.530 porcos recém-sequenciados e os combinamos com 729 animais adicionais de bancos de dados públicos para chamar variantes e desenvolver de longe o maior e mais diversificado painel de referência de haplótipos em porcos até o momento. Esse aumento substancial no número de genomas disponíveis nos permitiu imputar genótipos de matriz SNP a sequências genômicas inteiras com rapidez e precisão. Avaliamos a precisão da imputação e demonstramos a utilidade desse painel de referência de haplótipos no mapeamento de associação em todo o genoma. Introduzimos um novo servidor web público (swimgeno.org) onde os usuários podem enviar genótipos de matriz e recuperar genótipos de nível de sequência de genoma inteiro imputados. Este recurso irá melhorar muito o acesso à imputação de genótipos de alta precisão, facilitando o mapeamento genético potencialmente de resolução de nucleotídeos em porcos.
0.5) in 435 Durocs, 522 Landraces, 493 Yorkshires, 36 Meishans, 24 European wild boars, and 27 Asian wild boars. c Scatter plot of first two principal components of genotype matrix for common (MAF > 0.05) and LD-pruned variants. Points are color-coded according to their reported breed information. A preliminary principal component analysis was performed to visually inspect and remove clear outliers from clusters, which indicated errors in breed information. d Ancestries of pigs were estimated with variable (K = 2, 4, 6) numbers of postulated ancestral populations using the ADMIXTURE software. Estimated ancestries were plotted as stacked bar charts with breeds annotated on the top. In addition to annotations above the bar chart, broad geographical locations are also annotated below the bar chart for K = 6./p> 0.005 to construct the haplotype reference panel. To investigate factors that influence imputation accuracy, we considered different combinations of commonly used phasing and imputation software, including SHAPEIT4/IMPUTE5, Beagle5.2/Beagle5.2, and Eagle2.4/Minimac4. We defined imputation accuracy using three metrics, the overall concordance rate between imputed and observed genotypes, non-reference concordance rate summarizing accuracy for non-reference genotypes only, and squared correlation (r2) between imputed and observed genotypes. We focused on Landrace as the target set because it has the largest number of animals in the dataset. We held out 100 Landrace pigs sequenced at high coverage (>15X) and compared observed genotypes with imputed genotypes starting from sequencing-based genotypes at sites on a 50 K SNP array (GeneSeek GGP). Regardless of breed composition in the haplotype reference panel of fixed size, SHAPEIT4/IMPUTE5 outperformed Beagle5.2/Beagle5.2 and Eagle2.4/Minimac4 in all three metrics (Fig. 2a–c). SHAPEIT4/IMPUTE5 was therefore chosen for all subsequent analyses./p>94.24%), imputation using the SWIM panel developed in the present study was consistently higher than PHARP within each breed (Fig. 4b). The improvement was much more pronounced when considering the non-reference concordance rate and r2, two metrics that more faithfully reflect the accuracy, especially at low frequency (Fig. 4c, d). The difference between SWIM and PHARP could simply be a sample size difference, especially for the breeds evaluated. The final reference haplotype panel consisting of all 2259 animals is expected to achieve a concordance rate in excess of 95.84%, a non-reference concordance rate of 88.26%, and an r2 of 0.85./p>A) has been suggested as the causative mutation21 and extensively replicated in multiple genetic backgrounds23. Furthermore, mutations in MC4R are strongly associated with early onset obesity in humans24, and its role in the regulation of energy homeostasis is well established25. Importantly, the putative causal mutation in MC4R has been included in one of the commercially available SNP genotyping arrays, the Geneseek GGP Porcine 50K SNP Chip (Neogen, Lincoln, NE). However, the same SNP is not present in the more widely used Illumina PorcineSNP60 chip. To see if genotype imputation was able to correctly impute the genotypes of this SNP, we excluded the MC4R SNP and imputed whole-genome genotypes from a population of 3769 Duroc pigs genotyped using the GGP Porcine 50K SNP arrays. Remarkably, the concordance rate and r2 between the imputed and array MC4R SNP genotypes were 99.71% and 0.9916, respectively. We performed GWAS using array and imputed genotypes; both showed a major peak on chromosome 1 (Fig. 5a, Supplementary Data 3 and 4) and a clear deviation of P-value distribution from the null (Supplementary Fig. 4a). Using imputed genotypes, the highest hit from imputed SNPs (chr1:161511936:T > C, P = 2.98 × 10−13) explained 2.85% of the total phenotypic variance (Fig. 5a). Under this peak in a 4-Mb region (158.5–162.5 Mb), there were 7138 variants within 22 genes. Linkage disequilibrium in this region was extensive, with 1050 variants in strong LD (r2 > 0.8) with the top hit, including the MC4R SNP (Fig. 5b). The highest hit was an intronic SNP in the gene CCBE1 (Fig. 5b). However, the extensive LD in this region makes it difficult to pinpoint a causative mutation by genetic data alone. Additional functional information and genetic data that break the LD are necessary to further fine-map causative genes and mutations. Nevertheless, the ability to identify the putative MC4R causative SNP as one of the top associated variants in a long stretch of high LD region clearly demonstrated the improvement of resolution using imputed genotypes. In our analysis, the MC4R SNP was initially removed and would otherwise be invisible without the imputation, as would be the case if the Illumina PorcineSNP60 chips were used./p> C) is indicated by a gradient of blue color. Locations of genes are indicated in the box below the plot, where blue boxes and gene names with a left arrowhead (<) indicate genes transcribed on the reverse strand, and red boxes and gene names with a right arrowhead (>) indicate genes transcribed from the forward strand. Genes that are not marked do not have gene symbols. Gene locations are based on the Ensembl Release 98 annotation./p>T, P = 3.45 × 10−39). Remarkably, this variant explained 13.65% of the total phenotypic variance, and the homozygous C/C animals were, on average, 4.01 cm longer than the T/T homozygotes (Fig. 6b, c). BMP2 has been repeatedly shown to be associated with growth traits in pigs. A recent study implicated a regulatory variant upstream of the BMP2 gene and validated its functional impact using reporter genes26. This regulatory variant was the third most significant SNP under this peak in our analysis. Whether one or both of these potentially regulatory variants are the causative mutations remains to be determined. Given the strong association, high MAF of these SNPs, and less extensive LD in this region, it is unlikely that these regulatory variants were tagging protein-coding and less common variants in the BMP2 gene. In addition to the genetic support from this Yorkshire population, the body length increasing C allele was much more prevalent in Landrace than in other breeds. A hallmark of the Landrace breed is its long body size; thus, regulatory variation of the BMP2 gene may be a major contributor to the phenotypic differentiation between pig breeds. In contrast, although the SNP chip was able to broadly identify this region, the most significant SNP (chr17:15827832:T>G, P = 1.58 × 10−25) in an SNP chip-based GWAS was about 184 kb away from the lead SNP and explained a substantially smaller variance (8.22% versus 13.65%)./p>T) are indicated by a gradient of blue color. Locations of genes are indicated in the box below the plot and according to the Ensembl Release 98 annotation. All three genes are colored in red and transcribed from the forward strand. The only gene with a symbol in this region is BMP2. c Scatter and box plots of body length (in cm) for the three genotypes of the chr17:15643342:C>T SNP. The lower and upper boundaries of the box are, respectively, 25% and 75% quantiles of the data, the midline median, and the whiskers minimum and maximum. d Allele frequencies of the chr17:15643342:C>T SNP in different breeds./p> 54.69") were removed. Variant quality score recalibration (VQSR) on SNPs was performed with truth SNP sets compiled from commercial SNP arrays, including 50K, 60K, and 80K SNP chips (prior = 15.0) on the Illumina platform and the 660K (prior = 12.0), SowPro90 (prior = 15.0) SNP chips from the Affymetrix platform. SNPs were filtered with a truth sensitivity filter level at 99.0. Without a truth set of indels, we applied hard filtering on them by excluding indels with QD < 2.0, QUAL < 50.0, FS > 100.0, ReadPosRankSum < −20.0, as recommended by GATK's best practices. Additionally, we filtered out animals with a missing rate >0.20, heterozygosity >0.20, and retained bi-allelic sites with a missing rate <0.2 and mean sequencing depth between 5 and 500. Filtering was performed using a combination of VCFtools 0.1.1332 and BCFtools 1.1333 commands./p> 0.5) and low-frequency variants (MAF < 0.05). To understand the genetic structure in the population, we retained variants with MAF > 0.05 and missing rate <0.1 and pruned SNPs with LD (r2 < 0.3, -indep-pairwise 50 10 0.3) using PLINK 1.935. Principal component analysis (PCA) was performed on the filtered list of 1,223,882 variants using GCTA 1.93.236 for all individuals. Ancestries were estimated using ADMIXTURE 1.337 on 185 individuals randomly selected according to breed representation in the dataset or at least four individuals per breed. The downsampling was necessary to properly visualize population structure./p>0.1 and MAF < 0.005 were removed. Additionally, variants with a Hardy–Weinberg equilibrium test P-value < 10−10 implemented separately in PLINK in all three of the Duroc, Landrace, and Yorkshire pigs were removed. Only autosomal variants were retained for imputation./p>