wgsim.cr
Reimplement wgsim in Crystal and add extra features.
:yarn: :black_cat: Please note that this project is being created for personal study and experimental purposes and is not currently provided for practical purposes.
mut
: Adding mutations to the reference genome- SNPs
- Insertion (any length)
- Deletion (any length)
- Fasta Output
seq
: Simulation of short lead sequencing- Uniform substitution sequencing errors
- Fastq Output
gen
: Generate a random genome- Random genome generation
- Fasta Output
Installation
Compiling from source code
git clone https://github.com/kojix2/wgsim.cr
cd wgsim.cr
shards build --release -Dpreview_mt src/wgsim.cr
Homebrew
brew install kojix2/brew/wgsim
Usage
Program: wgsim (Crystal implementation of wgsim)
Version: 0.0.4.alpha
Source: https://github.com/kojix2/wgsim.cr
mut Add mutations to reference sequences
seq Simulate pair-end sequencing
gen Generate random reference fasta
--debug Show backtrace on error
-v, --version Show version
-h, --help Show this help
About: Add mutations to reference sequences
Usage: wgsim mut [options] -f <in.ref.fa>
-f, --file FILE Input file for the reference sequence (required)
-o, --output FILE Output file for the mutated sequence (required)
-m, --mutation FILE Output file for the mutations (required)
-s, --sub-rate FLOAT Rate of base substitutions [0.001]
-i, --ins-rate FLOAT Rate of insertions [0.0001]
-d, --del-rate FLOAT Rate of deletions [0.0001]
-I, --ins-ext-prob FLOAT Probability an insertion is extended [0.3]
-D, --del-ext-prob FLOAT Probability a deletion is extended [0.3]
-p, --ploidy UINT8 Number of chromosome copies in output fasta [2]
-S, --seed UINT64 Seed for random generator
--debug Show backtrace on error
-h, --help Show this help
About: Simulate pair-end sequencing
Usage: wgsim seq [options] -f <in.ref.fa> -1 <out.read1.fq> -2 <out.read2.fq>
-f, --file FILE Input file for the reference sequence (required)
-1, --output1 FILE Output file for the first read (required)
-2, --output2 FILE Output file for the second read (required)
-e, --error-rate FLOAT Base error rate [0.01]
-d, --distance INT Outer distance between the two ends [500]
-s, --std-dev FLOAT Standard deviation of the insert size [50]
-D, --depth FLOAT Average sequencing depth [10.0]
-L, --size-left INT Length of the first read [100]
-R, --size-right INT Length of the second read [100]
-A, --ambiguous-ratio FLOAT Discard if the fraction of N(ambiguous) bases higher than FLOAT [0.05]
-S, --seed UINT64 Seed for random generator
--debug Show backtrace on error
-h, --help Show this help
About: Generate random reference fasta
Usage: wgsim gen [options]
-l, --length INT Length of the reference sequence ["1000,700"]
-s, --seed UINT64 Seed for random generator
--debug Show backtrace on error
-h, --help Show this help
Idea Notes
Idea Notes
-
Somatic Mutations
- Broad Representation: Include
SNVs
,indels
,large insertions
,large deletions
, andtranslocations
. - Complete DNA Sequence in Fasta: Include the entire genome in the Fasta file.
- Broad Representation: Include
-
Haplotypes
- Ploidy: Include as many Fasta records as there are homologous chromosomes, depending on the cell's ploidy.
-
Structural Variations
- Inversion and Fusion: Accurately represent structural variations like inversions and fusions in the Fasta file.
-
Local Amplifications
- Extrachromosomal DNA: Include additional records for increased chromosome copy number due to extrachromosomal DNA.
-
Non-Compressed Genome Representation
- Data Structures: Use
UInt8
orRefBase
structures for each nucleotide to keep things simple.
- Data Structures: Use
-
Addressing Heterogeneity
- Fasta File per Cell Type: Each cell type has one Fasta file.
- Cell Type Proportions: Provide the proportion of each cell type.
-
VCF files have a dual purpose:
- They act as snapshots of the current state by capturing differences from the reference genome.
- They are presumed detailed records of genetic variations.
-
We attempt to infer mutations by observing individual genomes, but we can never fully reconstruct the events.
- In simulations, however, we can have a complete list of mutation events.
Development
Dependencies:
- kojix2/nworkers.cr - Set the number of worker threads at runtime.
- kojix2/randn.cr - Normal random number generator.
- kojix2/fastx.cr - Fasta file reader.
Contributing
- Fork it (https://github.com/kojix2/wgsim/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request