Statistics
Missing values
Julia, like R, has a value to represent missing values:
data = [ 1.0, 2.0, missing, 4.0 ]
4-element Array{Union{Missing, Float64},1}:
1.0
2.0
missing
4.0
This value implements three-valued logic:
false & missing
false
true & missing
missing
You can use ismissing
or skipmissing
when necessary:
sum(data)
missing
ismissing.(data)
4-element BitArray{1}:
false
false
true
false
sum(data[.!(ismissing.(data))])
7.0
sum(skipmissing(data))
7.0
DataFrames
It is very useful to work with tabular data. One of the most simplest Julia packages for that is DataFrames.
using DataFrames
To read this kind of files, you can use the CSV package.
using CSV
For example, the pdb_chain_taxonomy.tsv.gz
file that has a summary of the NCBI taxid(s), scientificname(s) and chain type for each PDB chain that has been processed in the SIFTS database. This table should be downloaded from the SIFTS site.
table_path = download(
"ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/tsv/pdb_chain_taxonomy.tsv.gz",
"pdb_chain_taxonomy.tsv.gz")
run(`gunzip $table_path`)
If that fails, you can use the first lines stored in the data folder:
using JuliaForBioinformatics
data_path = abspath(pathof(JuliaForBioinformatics), "..", "..", "data")
table_path = joinpath(data_path, "pdb_chain_taxonomy_head.tsv")
table_path = "pdb_chain_taxonomy.tsv"
df = CSV.read(table_path,
header = 2, ## the header is in the second line
delim = '\t', ## delimiter is TAB instead of ','
quotechar='`' ## file don't use "" to quote, e.g.: "Bacillus coli" Migula 1895
)
Examples
Select human PDB chains:
df[:TAX_ID] .== 9606
df[df[:TAX_ID] .== 9606, [:PDB, :CHAIN]] |> unique
You can use |>
for easy function chaining.
What are the species with more PDB chains?
count_df = by(df, :TAX_ID, Count = :TAX_ID => length)
sort!(count_df, :Count, rev=true)
Exercise 1
What are the species with more PDBs (not PDB chains)?
Hint: You can use anonymous functions:
f(x) = 2x + 1
g(x) = sin(π*x)
g (generic function with 1 method)
x -> f(g(x))
#1 (generic function with 1 method)
or function composition (using ∘
, \circ<TAB>
):
f ∘ g
#52 (generic function with 1 method)
# ...your solution...
Plots
They are multiple plotting packages in Julia. Here I will show StatsPlots, an extension of Plots for statistical plotting. However, if you love the grammar of graphics, you will be more comfortable with Gadfly.
using StatsPlots
@df count_df bar(:TAX_ID, :Count)
@df count_df marginalhist(:TAX_ID, :Count)
@df count_df violin(:Count)
@df count_df boxplot!([1.0], :Count, bar_width=0.1)
Exercise 2
Do a histogram and a density plot of the variable :Count
. Hint: Use normalize=true
This page was generated using Literate.jl.