public/gtconvert_py - Interbull Centre Wiki

Program gtconvert.py - User Manual

Information about the program

The program gtconvert.py converts the legacy file formats (fileCxxxf, fileCxxxr, fileDxxxf and fileGxxxr, for xxx in 010, 015-020, 115) into the new trait-independent vertical file formats that will be used for submitting EBVs to the IDEA DB in the near future. The program will find all the file{A}xxx{b} files in a specified DATADIR and convert them all, creating four files (file300Cf, file300Df, file300Cr and file300Gr) with separate bull proof records for all traits found in all the xxx files matching the specified breed of evaluation (BRD) and population/country code (POP). The program also converts the legacy parameter file into a trait info file specifically designed for the gebvtest program and creates a file of birth dates extracted from fileC010f.
All of the input files may contain data for more than one breed or population. The input files may have a SUFFIX, like ".usa" for example, but in this case all the files must have the same suffix.

Input files

fileCxxxf - national official genetic evaluations sent by the NGEC as input for the most recent Interbull MACE evaluation (formats: 010, 115, 015, 016, 017, 018, 019, 020)
fileDxxxf - daughter deviation file, including either DD or D_PGM for the same animals included in fileCxxxf (same formats as for fileCxxxf)
fileCxxxr - reduced conventional genetic evaluation file, obtained from conventional genetic evaluations using truncated data (same formats as for fileCxxxf)
fileGxxxr - reduced genomic evaluation file, obtained from genomic evaluations truncated data (same formats as for fileCxxxf)
parameter file - parameters used in most recent Interbull MACE evaluation - one file may contain all trait groups (Format: parameter file)

Running the Program

Usage notes

The program should be run from within the programs directory. Typing

python gtconvert.py --help

will give a summary of the program usage:

usage: gtconvert.py [-h] [-v] [-s SUFFIX] [-p PARFILE] [-d {DD,GM}] [-y YEAR]
                    [-x {Y,N}] [-o OUTDIR]
                    brd pop datadir

positional arguments:
  brd                   evaluation breed code (BSW/GUE/JER/HOL/RDC/SIM)
  pop                   population code (same as country code except for
                        CHR/DEA/DFS/FRR/FRM)
  datadir               absolute or relative path to data files

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase output verbosity
  -s SUFFIX, --suffix SUFFIX
                        suffix to add to all input file names, eg. ".usa" if
                        file names are like fileC010f.usa (default=none)
  -p PARFILE, --parfile PARFILE
                        path+name of input "parameter" file
                        (default=DATADIR/parameterSUFFIX)
  -e ENCODING, --encoding ENCODING
                        input file encoding (default=utf-8; try also
                        iso-8859-1 or other values listed at
                        http://docs.python.org/2/library/codecs.html#standard-
                        encodings)
  -d {DD,GM}, --depvar {DD,GM}
                        type of daughter performance on Df file (default=GM)
  -y YEAR, --year YEAR  minimum birth year for test bulls (default is year of
                        EVALDATE on parameter file less 8 years)
  -x {Y,N}, --type2x {Y,N}
                        inclusion of type 21+22 bulls in test group
                        (default=N)
  -o OUTDIR, --outdir OUTDIR
                        directory for output files (default=DATADIR)

Note that the input parameter file may be in a different directory than the other files or have a different name or suffix, in which case the -p option must be specified.
The program adds defaults for several options to the trait info file it creates. This file may need to be edited manually or programmatically if different options are needed for some traits compared to other traits.
You may also choose to put the output files from this program into a different directory than the input files. In this case, the specified OUTDIR from this program should be used as the DATADIR for the gebvtest.py program.

Warning:

If the gtconvert.py program crashes with a UnicodeEncodeError, it means there are likely binary character codes, most likely in bull names, which do not fit the standard utf-8 encoding scheme. You can try specifying the option '-e iso-8859-1' or some other encoding listed at http://docs.python.org/2/library/codecs.html#standard-encodings. If that fails, you could try to set the name field to blank in all input files, since the bull name field on the 010 files are no longer used at the Interbull Centre. Also, make sure there is no binary data in any other field, due to uninitialized variables in some Fortran program for example.

Example of command line

python3.2 gtconvert.py hol abc /rawdata/abc/gebvtest1209/HOL/ -p /abc/parameter.abc -e 'iso-8859-1' -s .abc -o ../data/1302/ABCHOL

In this example

python version 3.2 is used
breed of evaluation is HOL
population being evaluated is ABC
data are read from /rawdata/abc/gebvtest1209/HOL/
the parameter file is read from /abc/parameter.abc
'iso-8859-1’ is defined as the character format instead of the default format 'utf-8'
the suffix .abc is added to the input files
the outputs are written to ../data/1302/ABCHOL

Output files

The following files are written to the DATADIR or to OUTDIR, if specified. All files have a _POPBRD suffix, so that multiple sets of output files for different breeds or populations can co-exist in the same output directory, if desired.

traits - GEBV test options file(Format: traits)
file300Cf - national official genetic evaluations written in trait-independent format (Format: File300)
file300Df - daughter deviation file written in trait-independent format
file300Cr - reduced conventional genetic evaluation file written in trait-independent format
file300Gr - reduced genomic evaluation file written in trait-independent format
file736 - file with birth dates (Format: File736)

The execution log is written is written to stdout (i.e. the screen), so you should redirect output to a file if you would like to save it. An example gtconvert.log is available here.