extract_scg_bins.py¶
Usage¶
The usage and help documentation of extract_scg_bins.py
can be seen by
running pyhton extract_scg_bins -h
:
usage: - [-h] --output_folder OUTPUT_FOLDER --scg_tsvs SCG_TSVS [SCG_TSVS ...]
--fasta_files FASTA_FILES [FASTA_FILES ...] --names NAMES [NAMES ...]
[--groups GROUPS [GROUPS ...]] [--max_missing_scg MAX_MISSING_SCG]
[--max_multicopy_scg MAX_MULTICOPY_SCG]
Extract bins with given SCG (Single Copy genes) criteria. Criteria can be set
as a combination of the maximum number of missing SCGs and the maximum number
of multicopy SCGs. By default the script selects from pairs of scg_tsvs and
fasta_files, the pair that has the highest number of approved bins. In case
there are multiple with the max amount of approved bins, it takes the one that
has the highest sum of bases in those bins. If that is the same, it selects the
first one passed as argument.
One can also group the pairs of scg_tsvs and fasta_files with the --groups
option so one can for instance find the best binning per sample.
optional arguments:
-h, --help show this help message and exit
--output_folder OUTPUT_FOLDER
Output folder
--scg_tsvs SCG_TSVS [SCG_TSVS ...]
Single Copy Genes (SCG) tsvs as outpututted by
COG_table.py. Should have the same ordering as
fasta_files.
--fasta_files FASTA_FILES [FASTA_FILES ...]
Fasta files. Should have the same ordering as scg_tsvs
--names NAMES [NAMES ...]
Names for each scg_tsv and fasta_file pair. This is
used as the prefix for the outputted bins.
--groups GROUPS [GROUPS ...]
Select the best candidate for each group of scg_tsv
and fasta_file pairs. Number of group names given
should be equal to the number of scg_tsv and
fasta_file pairs. Identical group names indicate same
groups.
--max_missing_scg MAX_MISSING_SCG
--max_multicopy_scg MAX_MULTICOPY_SCG
Example¶
An example of how to run extract_scg_bins
on the test data:
cd CONCOCT/scripts/tests/test_data
python extract_scg_bins.py \
--output_folder test_extract_scg_bins_out \
--scg_tsvs tests/test_data/scg_bins/sample0_gt300_scg.tsv \
tests/test_data/scg_bins/sample0_gt500_scg.tsv \
--fasta_files tests/test_data/scg_bins/sample0_gt300.fa \
tests/test_data/scg_bins/sample0_gt500.fa \
--names sample0_gt300 sample0_gt500 \
--max_missing_scg 2 --max_multicopy_scg 4 \
--groups gt300 gt500
This results in the following output files in the folder test_extraxt_scg_bins_out/
:
$ ls test_extract_scg_bins_out/
sample0_gt300_bin2.fa sample0_gt500_bin2.fa
Only bin2 satisfies the given criteria for both binnings. If we want to get the
best binning of the two, one can remove the --groups
parameter (or give
them the same group id). That would only output sample0_gt500_bin2.fa
,
because the sum of bases in the approved bins of sample0_gt500
is higher
than that of sample0_gt300
.