Statistics

Statistics

Missing values

Julia, like R, has a value to represent missing values:

data = [ 1.0, 2.0, missing, 4.0 ]
4-element Array{Union{Missing, Float64},1}:
 1.0     
 2.0     
  missing
 4.0     

This value implements three-valued logic:

false & missing
false
true & missing
missing

You can use ismissing or skipmissing when necessary:

sum(data)
missing
ismissing.(data)
4-element BitArray{1}:
 false
 false
  true
 false
sum(data[.!(ismissing.(data))])
7.0
sum(skipmissing(data))
7.0

DataFrames

It is very useful to work with tabular data. One of the most simplest Julia packages for that is DataFrames.

using DataFrames

To read this kind of files, you can use the CSV package.

using CSV

For example, the pdb_chain_taxonomy.tsv.gz file that has a summary of the NCBI taxid(s), scientificname(s) and chain type for each PDB chain that has been processed in the SIFTS database. This table should be downloaded from the SIFTS site.

table_path = download(
    "ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/tsv/pdb_chain_taxonomy.tsv.gz",
    "pdb_chain_taxonomy.tsv.gz")

run(`gunzip $table_path`)

If that fails, you can use the first lines stored in the data folder:

using  JuliaForBioinformatics
data_path = abspath(pathof(JuliaForBioinformatics), "..", "..", "data")
table_path = joinpath(data_path, "pdb_chain_taxonomy_head.tsv")
table_path = "pdb_chain_taxonomy.tsv"

df = CSV.read(table_path,
    header = 2,  ## the header is in the second line
    delim = '\t',  ## delimiter is TAB instead of ','
    quotechar='`'  ## file don't use "" to quote, e.g.: "Bacillus coli" Migula 1895
    )

Examples

Select human PDB chains:

df[:TAX_ID] .== 9606
df[df[:TAX_ID] .== 9606, [:PDB, :CHAIN]] |> unique

You can use |> for easy function chaining.

What are the species with more PDB chains?

count_df = by(df, :TAX_ID, Count = :TAX_ID => length)
sort!(count_df, :Count, rev=true)

Exercise 1

What are the species with more PDBs (not PDB chains)?

Hint: You can use anonymous functions:

f(x) = 2x + 1
g(x) = sin(π*x)
g (generic function with 1 method)
x -> f(g(x))
#1 (generic function with 1 method)

or function composition (using , \circ<TAB>):

f ∘ g
#52 (generic function with 1 method)
# ...your solution...

Plots

They are multiple plotting packages in Julia. Here I will show StatsPlots, an extension of Plots for statistical plotting. However, if you love the grammar of graphics, you will be more comfortable with Gadfly.

using StatsPlots
@df count_df bar(:TAX_ID, :Count)
@df count_df marginalhist(:TAX_ID, :Count)
@df count_df violin(:Count)
@df count_df boxplot!([1.0], :Count, bar_width=0.1)

Exercise 2

Do a histogram and a density plot of the variable :Count. Hint: Use normalize=true

This page was generated using Literate.jl.