Michael Mooney, PhD
CODE

PolyGA:  User Manual

(Back to PolyGA)

Table of Contents

I. Installation
A. Dependencies
B. PolyGA
II. Quick Start Guide
A. Using the supplied example scripts and data
III. Input
A. PolyGA parameters
1. Command-line parameters
2. Configuration file
B. Genome annotations
C. R script
IV. Output
V. Parallelization
A. Parallel Python
B. R packages for parallel processing


I. Installation (Back to top)

A. Dependencies

The PolyGA program requires a number of Python modules and the R statistical environment to be installed:

1. Argparse
2. PyTables
3. NetworkX
4. Rpy2 (and R)
5. Parallel Python
6. Matplotlib

Note: The development of PolyGA has been done in Linux and Mac OS X environments. The installation of the specified dependencies was found to be very simple on Ubuntu Linux since Ubuntu packages exists for all of them. While it should be possible to get PolyGA running on Windows, this has not been tested. In the future I hope to be able to provide executables (using a tool such as cx_Freeze) that will simplify the installation procedure.

A-1. Argparse
The argparse Python module is included with Python >= 2.7 and Python >= 3.2 (for earlier versions it is available as a separate module).

Argparse documentation: http://docs.python.org/2.7/library/argparse.html

A-2. PyTables
PyTables is a Python module that allows for the creation and manipulation of HDF5 files. The Heirarchical Data Format (HDF) is a set of libraries and file formats designed to efficiently store and organize large data sets. To install PyTables from source you must have Python (>= 2.4), the HDF5 library (>= 1.8.4), and the NumPy (>= 1.4.1) and Numexpr (>= 1.4.1) modules installed. You may also need Cython (>= 0.13). Precompiled binaries for Windows are available, which include the DLLs for HDF5 (you will still need the pre-requisite Python, Numpy and Numexpr installed). For more detailed installation instructions see the following links:

Pytables: http://www.pytables.org/moin/PyTables
PyTables documentation: http://pytables.github.com/usersguide/
PyTables downloads: http://sourceforge.net/projects/pytables/files/pytables/
HDF5: http://www.hdfgroup.org/HDF5/

A-3. NetworkX
The NetworkX Python module provides data structures and algorithms for large-scale graph analysis.

NetworkX documentation and downloads: http://networkx.github.io/
NetworkX installation instructions: http://networkx.github.io/documentation/latest/install.html

A-4. Rpy2
The Rpy2 module allows a Python program to communicate with the R statistical environment. The use of Rpy2 allows for an amazing amount of flexibility for the types of statistical analyses that can be performed in conjunction with the feature selection procedure of the PolyGA program. However, it would not be difficult to modify the Python code to allow statistical analyses to be done directly within Python.

Rpy2 documentation: http://rpy.sourceforge.net/rpy2.html
Rpy2 downloads: http://sourceforge.net/projects/rpy/files/rpy2/
R project: http://www.r-project.org/

A-5. Parallel Python
Parallel Python (PP) is a module that allows for parallel execution of python code on both multi-core computers and clusters.

PP documentation: http://www.parallelpython.com/

A-6. Matplotlib

Matplotlib documentation: http://matplotlib.org/

B. PolyGA

To install the PolyGA program, simply download the program archive and extract the contents. The archive contains the main program and required modules. It also includes example input files, an example R script and example simulated genotype data. The main program is named "polyga.py". The modules "polyga_utils.py" and "polyga_core.py" must be kept in the same directory as the main program. See the Quick Start Guide for instructions about how to run an analysis using the supplied example data.

tar -xzvf polyga-1.1b.tar.gz
cd polyga-1.1b

II. Quick Start Guide (Back to top)

A. Using the supplied example scripts and data

View the Quick Start Guide here.

III. Input (Back to top)

A. PolyGA parameters

1. Command-line parameters

Use the -h option to view all the command-line options available for the PolyGA program and the appropriat usage.

python polyga.py -h

A description of each option is below:

-h      Displays the help message
-a      Specifies the level of analyis. The options are 'gene' and 'feature' (default).
-c      Specifies the path to the program configuration file.
-g      Specifies the path to a tab-delimited gene annotation file.
-i      Specifies the path to a tab-delimited gene-gene interaction file.
-f      Specifies the path to a tab-delimited gene-feature map file.
-d      Specifies the path of an HDF5 data file containing the gene, interaction and feature info.
-o      Specifies the prefix for the output files.
-r      Specifies the path to a results file. This option is used to print and plot results.
-t      Specifies a fitness threshold for printing significant results. Requires the -r option.
-p      A performance plot (-log(fitness) vs. generation) will be created. Requires the -r option.
-cyto   Files that can be input into Cytoscape will be created. This option requires the -r option,
        and either a fitness threshold specified with the -t option, or two integers indicating the
        generation and group number of a specific feature set in the results file.

2. Configuration file

The configuration file contains all the required algorithm parameters, including the GA parameters (population size, mutation probability, etc.), the R script used for calculating the association statistics, and parameters for parallelizing the statistical calculations. The configuration file is a simple text file where each line contains a variable=value statement. Empty lines and lines beginning with a hash symbol (#) will be ignored. The configuration file should contain entries for the following parameters (an example configuration file 'polyga_GenABEL.conf' is provided in the program archive).

# Example parameters
pop_size=80
generations=50
min_group_size=2
max_group_size=2
connected=False
select_type=hybrid
hybrid_top=2
migrants=4
cross_type=uniform
cross_prob=0.4
mut_prob=0.4
elite_num=2
node_weight=0.75
edge_weight=0.25
R_script=polyga_Rscript_GenABEL.r
generation_restart=-1
fitness_restart=-1
nprocs=1

An explanation of each parameter is below:

pop_size             [integer] The GA population size
generations          [integer] The number of generations the GA will run
min_group_size       [integer] The minimum number of features (or genes) allowed in each group
max_group_size       [integer] The maximum number of features (or genes) allowed in each group
connected            [True | False] Require the feature set to be connected in the network?
select_type          [truncate | roulette | hybrid] The GA selection method
hybrid_top           [integer] The number fittest individuals selected with the hybrid method
migrants             [integer] The number of migrants (new individuals) created at each generation
cross_type           [uniform] Crossover method (only uniform supported at this time)
cross_prob           [float] The crossover probability
mut_prob             [float] The mutation probability
elite_num            [integer] The number of individuals that bypass mutation at each generation
node_weight          [float] The weight applied to the gene score when stepping through the network
edge_weight          [float] The weight applied to the interaction score (must equal 1.0 - node_weight)
R_script             [path] A path to the R script that will be used for statistical calculations
generation_restart   [integer] The search will restart if this number of generations passes with no
                     fitness improvement
fitness_restart      [float] The search will restart after this fitness value is reached
nprocs               [integer] The number of parallel processes to start

B. Genome annotations

The gene score file should contain two columns. The first column is the gene ID and the second is a score indicating the likelihood that the gene affects the trait being studied. An example is below:

gene   score
TP53   1
ACT    1
CD44   50
NF2    100
ERBB2  100

The gene interaction file should contain three columns. The first two columns contain the IDs of two interacting genes, and the third column contains a score indicating the confidence in that interaction. An example is below:

gene1  gene2  score
TP53   ACT    110
CD44   NF2    88
CD44   ERBB2  100

The gene-feature map contains three columns. The first column is the feature ID (in the example below, the features are SNPs), the second column is the gene ID to which that feature is mapped, and the third column is a score for the feature.

snp     gene   score
rs1235  CD44   4
rs1234  ERBB2  1
rs1111  NF2    2
rs1236  TP53   1
rs1121  ACT    2

C. R script

An R script containing two functions is required for calculating the fitness values for each candidate feature set selected by the PolyGA algorithm. The R script will be sourced when the PolyGA program is started, and the fitness function will be called at each generation. Finally, a clean-up function will be called when PolyGA is finished running.

The R script must contain two functions named "r_get_fitness" and "shutdown_r", but it may also contain commands for loading the genetic data, etc. The "r_get_fitness" function simply takes a list of feature sets as a parameter and returns a list of fitness values. The "shutdown_r" function should contain any commands needed to quit R cleanly. For example, if the R calculations are parallelized (see section V below) you may need to ensure that all parallel processes are stopped before quitting R. This function definition is required, but it can be empty. An example R script 'polyga_Rscript_GenABEL.r' is provided in the program archive.

## R script template

r_get_fitness = function(groups) {
   # This function takes a list of lists of features
   # and returns a list of fitness values
   fit_vals = c()
   for (i in 1:length(groups)) {
      # calculate fitness for each group, and add fitness value to list
      fit_vals = c(fit_vals, fitness)
   }
   return(fit_vals)
}

shutdown_r = function() {
   # This function simply performs any commands needed
   # before stopping the R process
}

IV. Output (Back to top)

To view the results, use the -r option and the -t option to specify the results file and fitness threshold. Any feature sets with fitness values less than or equal to the specified threshold will be printed to a tab-delimited file. The file created will have the same prefix as the results file and '.assoc' as the extension. If the command below is run multiple times (e.g. with different fitness thresholds) numbers will be added to the file prefix (e.g., test_out.1.assoc, test_out.2.assoc, etc.)

python -r test_out.h5 -t 0.05

Including the -p option will output a PDF plot of the GA performance (-log(fitness) vs. generation), in addition to the tab-delimited results file.

python -r test_out.h5 -t 0.05 -p

The -cyto option will output files that can be used as input for Cytoscape, so that the discovered polygenic associations can be visualized within a network context. The -cyto option requires the results file to be specified with the -r option and either a p-value threshold (specified with the -t option) or two integers that specify the generation and group number of a specific feature set in the results file.

python -cyto -r test_out.h5 -t 0.05

python -cyto -r test_out.h5 100 9

V. Parallelization (Back to top)

A. Parallel Python

If the 'nprocs' option in the PolyGA configuration file is set to a value greater than 1, the Parallel Python (PP) module will be used to parallelize the GA. This can significantly improve runtime, especially when the algorithm is searching within large networks. PP can take advantage of multiple cores on a single machine and can also be used on clusters (although the setup for clusters is slightly more involved).

PP documentation: http://www.parallelpython.com/

B. R packages for parallel processing

Because the most computationally intense part of the search algorithm is the fitness calculations (the statistical test of association) it is usually a good idea to parallelize these calculations. The SNOW and Rmpi R packages, among others, can be used to parallelize R calculations.

SNOW: http://cran.r-project.org/web/packages/snow/index.html
Rmpi: http://cran.r-project.org/web/packages/Rmpi/index.html

evo·comp·bio