Appendix X - Interbull validation test for genomic evaluations – GEBV test

Document based on Mäntysaari, E., Liu, Z and VanRaden P. 2011. Interbull Validation Test for Genomic Evaluations. Interbull Bulletin 41, p. 17-21.

Motivation

The inclusion of genomic information in international comparisons for dairy breeds requires that the national genomic breeding values (GEBVs) get validated by Interbull in a similar fashion that conventional EBVs are validated as a pre-condition to participate in the MACE evaluations.

The GEBV test will be applied to validate national models used to compute GEBVs that the national genetic evaluation centers (NGEC) publish and will eventually submit to Interbull for international genetic evaluations including genomic information. The GEBV test can be considered also a quality assurance assessment for national genomic evaluations. GEBVs from models that have been tested can be referred to as breeding value estimates with appropriate reliability, and be converted to other country scale breeding values using conversion equations derived by Interbull.

Rationale

The GEBV test evaluates:

The test for bias is done by verifying the ability of a model only including data from 4 years ago to predict current performances. NGEC have to exclude the last 4 years of data and re-run the analyses with the reduced data, with the same model that are being tested. However, in some cases the bull generation available for validation has not been genotyped in everything and all. Thus, bulls exist that will get more than 20 daughters in the full data, but that have no GEBVs. This is called selective genotyping, and it leads into systematic bias in the validation bull group. In the test, this bias needs to be corrected by accounting for the selection between the mean national EBV (current, conventional) of the bulls genotyped and the overall mean national EBV including all potential candidates. This selection differential can be used to derive the expected regression coefficient, which would be equal to unity as if no selective genotyping took place.
Testing the improvement in accuracy is done by comparing the coefficient of determination (R2) of the reduced genomic model and the equivalent reduced conventional model (from 4 years ago) regressed to current performances. The R2 from the model including genomic information must be higher than the model including only parent average information.

Test data sets

For each trait group, five data sets will be necessary for the GEBV test: two full data sets, two reduced data sets and one additional file with genotyping information on test bulls. In order to facilitate data preparation and reading, Interbull file formats 01X will be used.

Full data sets

The full data sets include all animals present in the latest international evaluation before validation data is prepared (CURRENT), without editions. They are of two types, one containing national official genetic values (EBVs) and another containing either de-regressed predicted genetic merits (D_PGMs) or daughter deviations (DDs).

National official genetic evaluation file (fileCxxxf)

The files sent by the NGEC as input for the latest Interbull routine run (formats 010, 115, 015, 016, 017, 018, 019, 020) will be used by the Interbull Centre and the NGEC do not have to provide these files again. 3.1.2 Daughter deviation file (D01Xf) The NGEC needs to prepare either DD or D_PGM for the same animals included in 3.1.1. These values represent the currently estimated performance of the animals and will be used as the dependent variable in the validation procedure. EDC and reliability estimates should be exactly the same as the 01X file in 3.1.1. 3.2 Reduced data sets The reduced data sets should be prepared by truncating the phenotypes used as input for both the conventional and the genomic evaluations. The NGEC must exclude phenotypic information from the past 4 years and re-run the current models of genetic/genomic evaluation for the traits of interest, keeping the animals without progeny information after truncation (test bulls) in the data in order to obtain genetic merit estimates based solely on parent averages (EBVr) or on parent averages plus genomic prediction equations (GEBVr). 3.2.1. Reduced conventional genetic evaluation file (C01Xr) The NGEC should carry out a conventional genetic evaluation using truncated data (only phenotypes up to 4 years prior to the date of analysis) but including in the analysis all animals present in the current official evaluations (C01Xf). All animals in the C01Xf must be included in the C01Xr file sent to Interbull. 3.2.2. Reduced genomic evaluation file (G01Xr) Similarly, new genomic evaluations should be carried out using exactly the same model being validated (current) but excluding phenotypic information up to four years ago (truncated data C01Xr). (All bulls that did not have a progeny test 4 years ago and that currently have at least 20 daughter-equivalents in the national genetic evaluation (test bulls) need to have a genomically enhanced EBV (GEBVr) estimated and included in the output. All animals included in the C01Xf must be included in the G01Xr file sent to Interbull. If a significant number of foreign animals are included in the reference population and estimation of genomic prediction equations uses de-regressed MACE values for these animals as input, the reduced genomic evaluation can be achieved in two ways:

Test Data

Type of information

File types and formats

Specific variablesa (equivalent field in the 01X file)

EDC

Reliability

EBV

Full data sets

Conventional Genetic data

C010f, C115f, C015f, C016f, C017f, C018f, C019f, C020f

EDC

r2EBV

EBV

Daughter deviation data

D010f, D115f, D015f, D016f, D017f, D018f, D019f, D020f

EDC

r2EBV

D_PGM (or DD, if available)

Reduced data sets

Conventional Genetic data

C010r, C115r, C015r, C016r, C017r, C018r, C019r, C020r

EDCr

r2EBVr

EBVr

Genomic data

G010r, G115r, G015r, G016r, G017r, G018r, G019r, G020r

GEDCr

r2GEBVr

GEBVr

e. The method of estimation of GEDCr (and/or r2GEBVr) has to be reported in the Interbull GENO form. f. The GEBVr prediction equations also have to be based on the truncated data. If the GEBVr combines information of DGV and EBV (i.e. PA), the EBV (PA) information has to be also from the truncated data. g. Bulls with EBV in the full data sets that have no progeny information four years ago (EDCr=0), should be included in the reduced data set.

i

p

x

E(b1)

Rb2 = 0.50

Rb2 = 0.55

Rb2 = 0.60

Rb2 = 0.65

Rb2 = 0.70

0.644

60

-0.253

0.594

0.619

0.646

0.676

0.709

0.570

65

-0.385

0.626

0.650

0.677

0.705

0.736

0.497

70

-0.524

0.660

0.683

0.708

0.735

0.764

0.424

75

-0.674

0.697

0.718

0.742

0.766

0.793

0.350

80

-0.842

0.736

0.756

0.777

0.800

0.823

0.274

85

-1.036

0.781

0.799

0.817

0.836

0.856

0.195

90

-1.282

0.832

0.846

0.861

0.876

0.892

0.109

95

-1.645

0.894

0.904

0.914

0.924

0.934

0.000

100

1.000

1.000

1.000

1.000

1.000

  1. Documents to be submitted by participating NGEC 5.1. Interbull GENO form

  2. The methodology for estimation of GEBV and its’ accuracy (r2GEBV) have to be reported by the NGEC in Interbull GENO form (Appendix IV). 5.2. GEBV test estimates (File format 731) The NGEC submitting genomic data for validation is required to provide also Form 731 (Appendix III), which contains the results from the validation test obtained by the applicant, as well as descriptive statistics needed for the correct treatment of the submitted data. Eventual discrepancies between values in file 731 and estimates obtained by ITBC will be discussed with the NGEC submitting data prior to publication of results. APPENDIX I – Definitions: · EBV – Estimated Breeding Value (conventional national evaluations of the trait, free of genomic information, which are submitted to Interbull to be used in MACE evaluations) · DGV - Direct Estimated Genomic Value (genomic evaluations based on SNP prediction equations) · GEBV – Genomically Enhanced Estimated Breeding Value (evaluations that combine EBV and DGV) · EDC –Effective Daughter Contribution · GEDC – Genomically Enhanced Effective Daughter Contribution (EDC plus the genomic contribution) · GMACE - Multiple Trait Across Country Genomic Evaluation · PA – Parent Average · D_PGM – De-regressed Predicted Genetic Merit · DD – Daughter Deviation · NGEC - National Genetic Evaluation Centre · λ = (4-h2)/h2 · r2 – Reliability of the bull’s evaluation · R2 – Accuracy of the test model

    APPENDIX II - FILE FORMAT 732

  3. File format for genotyping information on test bulls.

Starting position

Field description

Format

Example

1

Record type

Character 3

732

4

Breed of evaluation

Character 3

HOL

7

Breed of bull

Character 3

HOL

10

Country of first registration of full

Character 3

13

Sex

Character 1

M

14

ID number of bull

Character 12

000000A12345

26

Name of bull

Character 30

56

Country sending this information

Character 3

59

Flag if bull has been genotyped

Character 1

Y=yes; N=no

Starting position

Field description

Format

Example

1

Record type

Character 3

731

4

Country sending this information

Character 3

7

Breed of evaluation

Character 3

HOL

11

Date

Integer 8

20100215

20

Trait

Character 3

pr

25

Model*

Integer 2

1

30

Mean of dependent variable

Real F15.7

12.3456789

45

SD of dependent variable

Real F15.7

4.9876543

60

Type of dependent variable used

Character 3

DD = daughter deviation; GM = de-regressed predicted genetic merit

63

Mean of independent variable

Real F15.7

11.9876543

78

SD of independent variable

Real F15.7

4.6549871

93

b0

Real F15.7

0.1234567

108

Standard error of b0

Real F15.7

0.0000123

123

b1

Real F15.7

1.1234567

138

Standard error of b1

Real F15.7

0.0000987

153

Selection intensity (i)

Real F15.7

0.4240000

168

Expected value of slope (b1)

Real F15.7

0.7700000

183

Number of test bulls

Integer 6

1500

198

R2 of the model

Integer 3

99