GDE_logo.png 2020-Aug-10

GenoEx - GDE User’s manual v.1.0

part 1 - File Preparation Manual (gxprep.py)

GenoEx-GDE database allows exchange of large genomic data files with the use of specific file formats that allows significant reduction in the file size – 706 (see 1.1.).

gxprep.py program, maintained by the Interbull Centre, allows easy transformation of typical laboratory output files to formats ready for upload to GenoEx-GDE database.

This manual describes the usage of gxprep program showing step by step how to:

1. GenoEx-GDE upload files’ formats Genoex-GDE allows upload of two types of files

Note: Both input files are prepared from the laboratory output files by the gxprep.py program (section 2). The program is also assigning an unique UUID identifier to each genotype allowing distinguishing between several records of one animal. Both file are delimited by semicolon.

1.1. File 706

This file contains the actual genomic data, as well as the information about the animal, genotyping laboratory and chip used for the genotyping.
Because typical genomic data contain a lot of information, data in this file is coded down to a single digit per SNP and single record per animal. Correctness of such coding requires SNPs to be written in certain order within the data stream, according to the SNP order list, where particular SNPs are recognized by name and given a position in the data stream. This coding, although allowing easy exchange of really large files, because of its dependency on the order, is unfortunately also prone to errors. Therefore, to ensure the highest data quality in the GenoEx-GDE database, we provide a program called gxprep that takes raw laboratory files with your data as input, fetches correct SNP order list from our servers and produces correctly ordered 706 file ready to be uploaded to the database. See section 2.

706 file format

Field Description

Format

Example

Record type 1

integer 3

706

Breed of animal 4

character 3

BSW

Country of first registration of animal 2

character 3

AUS

Sex

character 1

M

ID number of animal 5

alphanumeric 12

000000A12345

Organization sending this information

character

ANAFI

UUID 6

alphanumeric 36

assigned automatically by the program

Genotyping laboratory

character

Weatherbys Ireland

Sample ID

alphanumeric

S1234WI2001

Additional

for future reference

Array identifier7

integer

54609

AB – Genotype for SNP Index 1 9

integer 1

0

AB – Genotype for SNP Index 2 9

integer 1

1

AB – Genotype for SNP Index … 9

integer 1

2

AB – Genotype for SNP Index n 8,9

integer 1

5

  1. Record type is always 706 for this File Format
  2. ISO 3166-1 alpha-3 codes (3 characters, capital letters)
  3. Breed of evaluation (3 characters, capital letters, BSW, GUE, HOL, JER, RDC, SIM)
  4. Breed of animal (3 characters, capital letters)
  5. Alpha-numerical, Interbull standard, always 12 characters long
  6. UUID, one for every uploaded genotype sequence.
  7. Number of SNP in the SNP-Chip being reported
  8. n is equal to the value in the field 'Array identifier'
  9. coded SNP values written as a continuous string.
    Acceptable values depend on the Illumina coded allele values, according to the following:

BB→0
AB→1
AA→2
‘unknown’→5

706 example

706;BSW;AUS;M;000000A12345;09c98b1e-6af8-4254-9768-58d7cd1ddafd;54609;Weatherbys Ireland;S1234WI2001;;021010…

1.2. File 711

711 file format

Field Name

Format

Example

Record Type 1

alphanumeric 3

711

Animal ID 2 - Breed Code 3

character 3

BSW

Animal ID - Nation Code 4

character 3 (with the exception of 840)

AUS

Animal ID - Sex Code

character 1

M

Animal ID - Registration

alphanumeric 12

000000A12345

UUID 5

alphanumeric 36

assigned automatically by the program

Shareable with organization(s) 6

character, repeatable

BFRO, ITBC

  1. Record type is always 711 for this File Format
  2. Please see Interbull Bulletin 28. Each file can only contain any given animal in one row.
  3. Breed of animal (3 characters, capital letters)
  4. ISO 3166-1 alpha-3 codes (3 characters)
  5. UUID, used as reference to every uploaded genotype sequence in the 706 file.
  6. Comma-separated list of zero or more organizations that should be allowed to download the associated genotype

711 example

711;BSW;AUS;M;000000A12345;09c98b1e-6af8-4254-9768-58d7cd1ddafd;54609;ITBC,BFRO

2. Upload preparation program – gxprep.py

gxprep.py, maintained and distributed by Interbull Centre, prepares a set of files: 706 and 711, ready to be uploaded to GenoEx-GDE database.

2.1. gxprep functions

The program has four main commands: parse, sharing, show and zip parse should always be run as first option, because the other commands run on the output files produced by this command. sharing and show can be run several times, allowing gradual fine adjustments of the sharing permissions. zip is to be run at the end of the preparation process, when both files are ready for the upload.

parse

reads the input files (see section 2.2) and produces file 706 and initial 711 file - with all the animal IDs and corresponding UUIDs - but only the default (if set) sharing permissions assigned Most of the standard laboratory output files can be used as input unmodified (see 2.2) but the User has to provide additional information while running this command. Note: All these values could be provided also as default by specifying them in an initialization file (see 2.3.)

parse example

python gxprep.py parse -a 7931 -l CIGENE -s SampleMap.txt -C ~/gde/gxprep/cachedir -i "1 2 3 4" Iexample.txt

This command first retrieves the SNP order file from Interbull server and saves it to /gde/gprep/cachedir folder. Then, it parses the Iexample.txt file retrieving animal IDs from SampleMap.txt file (see section 2.2. for description of input files). In Iexample.txt, it looks for SNP name in column 1, sample ID in column 2, allele1 in column 3 and allele2 in column 4. Allele 1 and allele 2 are then coded to one digit, accordingly to the formula: BB→0, AB→1, AA→2, ‘unknown’→5 and placed in the genotype string according to the SNP order. All samples listed in this file are get CIGENE as laboratory. Also, each newly created record gets assigned UUid identifier and the same number, along with corresponding animal ID is listed in 711 file. If you set up any defaults for sharing (see section 2.3) they will also be used in newly created 711 file, otherwise the last column in this file will remain empty.
The files created in this example will be named Iexample-706.csv and Iexample-711.csv

sharing

is used to add or remove organizations from the list of the organizations allowed to download given genotype. This command only operates on the 711 file and thus ignores the 706 file, if present. sharing i

Note: Newly created, by parse command 711 file, will normally have the last column (‘Shareable with organizations’) empty, unless the defaults are specified otherwise (see 2.3.). Therefore, in most of the cases it is necessary to run sharing to create the list of organizations each records can be shared with. This can be done either by assigning the same permissions to all the data within the file, by adding them according to pattern defined by breed, sex or country of origin or by providing a list of specific animals that should have the sharing permissions changed.

This command takes the following arguments:

sharing examples

This command adds sharing permissions for ANAFI and BFRO to all the HOL males originating from France in Iexample-711.csv file

This command adds sharing permissions for ITBC and removes it for ANAFI for all the genotypes in Iexample-711.csv file according to the animal ID list in the file aidlist.txt

show

gives an overview of sharing patterns in given file, using the stem of input files as the only argument.

show example

This command shows the summary of the sharing settings, giving an output like this:

Content of sharing intermediate file Iexample-711.csv:
11 genotypes (all female) shared with ITBC
9 genotypes (all male) shared with BFRO, ITBC

zip

zip example

After running this command both 706 and 711 files will be zipped in two separate files ready for the upload to GenoEx-GDE database.

2.2. gxprep input files

gxprep is constructed to accept most typical laboratory output files and convert it to 706 data file adding also the initial version of 711 file for setting data sharing permissions. Of course, since both 706 and 711 files base on animal ID, whereas laboratory output files operate on sample IDs, the User also needs to provide a reference file mapping each sample ID to the corresponding animal ID.

The laboratory output file and ID reference file are expected to follow the formats as described below:

laboratory file

This file contains actual data as received from the laboratory, with Sample ID as a key. In the examples above this file is named Iexample.txt

[Header]

optional, general information regarding analysis, chip and number of samples

[Data]
please make sure that is specified, no matter if the header is included or not.
! [Data] marks the place where reading the information is started.

Field Name

Description

Allowed Values

SNP name

Alphanumeric

SNP name in CAPITALS e.g. ARS-BFGL-NGS-64740

Sample ID

Alphanumeric

Laboratory sample ID, has to correspond to animal ID in key file

All1

Alphabetic

1 character code A or B according to Illumina AB coding

All2

Alphabetic

1 character code A or B according to Illumina AB coding

Note: the above columns are required for GenoEx. Laboratory file can however contain additional columns, or columns in different order. As described in point 1.2.1. under parse, the User can specify which columns are containing required information.

ID reference file

This file contains the key to identify which Sample ID belongs to which animal. In the examples above this file is named SampleMap.txt
Note: The only allowed delimiter allowed in this reference file is TAB

Field Name

Description

Allowed Values

Sample ID

Alphanumeric

Laboratory sample ID, has to correspond to Sample ID in genotyping file

Animal ID

Alphanumeric

International Interbull ID*

* International Interbull Animal ID consists of 18 characters as follows:

3 characters - breed code (capitals, according to ICAR breed coding),
3 characters – country code (capitals, according to Interbull country coding),
1 character – sex code capital M or F),
11 characters – registration ID (alphanumerical).

2.3. gxprep default settings

If the User is always using the same array, laboratory, the same columns in the laboratory file or always shares all the data with the same list of the organizations, they may want to pre set these values as default. This can be done by editing gxprep.ini file. Dependently on your local settings, this file is located in the current directory and/or in the users home directory and also the .gxpreprc file in the user’s home directory.

Recognized configuration options are the following:

In the example above, if not specified otherwise in the command line:

3. gxprep "Tips and Tricks"

Whereas all currently allowed values are available to view via GenoEx home page, one can also see it directly in the terminal with use of gxprep.py
Below, there is a list of commands to retrieve specific lists

list of supported arrays

list of supported labs

list of supported organizations

list of supported country codes

list of supported breed codes

download specific SNP order file - Windows

download specific SNP order file - Linux

Note: Trailing arguments ‘xxxx xxxx’ can be replaced by any other nonsense words at least 2 characters long.

public/GDE_gxprep_manual (last edited 2020-08-31 15:35:36 by JoannaSendecka)