Open Sourcing Science

Open source tools for Scientific Research

Amy Boyle @amylouboyle

I am going to talk making science research more open source. Both from the angle of using existing open source tools, and making your own code open source.

  • Worked in a neuroscience research lab
  • We study basic auditory neuroscience using electrophysiology

How do I Science?

Open science vs open source

Back in the old days...

  • Back in the days when people had to correspond by postal mail and telephone...
  • Only way to get more information on a project was to contact the paper author directly
  • If they have time, they respond

"Stimulus generation was controlled by custom-written software on a personal computer..."

"... as measured by custom-designed software performing a fast Fourier transform of the digitized microphone signal."

The internet!

With it ushered in a new age of easy widespread sharing of tools and collaboration! Reproducibility abounds!

image: http://allycatblu.deviantart.com/art/Puppy-Twilight-Sparkle-And-Rainbow-Dash-426306666

Change is hard.

Except that didn't happen. Not quite. Change is hard. Still, often the only way to get data or associated code is to contact a paper author directly, and ask for it.

Difference is you can email someone instead of write them * still need to hope they respond in a timely manner * they still have what you want

image: http://www.reddit.com/r/aww/comments/27t8mk/i_took_my_new_german_shepherd_puppy_to_the_beach/

Why ?

Science has a reproducibility problem

  • Its a matter of incentives
  • Scientists are typically evaluated based on the number of papers they have published/ quality of the journals
  • Not on whether their findings can be reproduced.

There is little perceived incentive to spend much of your valuable time on reproducibility.

A 2012 study found that only 25% of the papers they reviewed were reproducible. Another found 10%.

extra:

Reproducibility project: conducting a study to investigate the replicability of cancer biology studies. The top 50 most impactful cancer biology studies published between 2010-2012 are being replicated by the Science Exchange network.

Reproducibility Initiative: http://blogs.plos.org/everyone/2012/08/14/plos-one-launches-reproducibility-initiative/

I care cuz why?

Everyone:

  • Scientific process is ultimately self-correcting,
  • With enough testing, incorrect data will eventually be discovered and disregarded
  • we have a responsibility to one another and society (taxpayers) to increase this process’s efficiency

Scientists:

[Sharing Detailed Research Data Is Associated with Increased Citation Rate](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000308)

Duplication of effort

image : http://www.reddit.com/r/aww/comments/2sbxhe/my_german_shepherd_pup_gets_a_little_confused/

What we can do about it?

Docs, tests, version control, Access.

What practices can we promote to get to where we need to be?

This talk is mostly concerned about the source code piece of the reproducibility puzzle. This means docs, tests, version control, and access to the code.

Scientist coders are notorious for not including docs, tests, and version control with their code. These are the things that will make your code reusable to future you and others.

Learning the Hard Way

Docs or it didn't happen.

  • No docs for batlab, only oral tradition. Don't even attempt to look at code.

Instructions -- under version control Code docs

No tests?

image : http://giphy.com/gifs/spoilers -neil-degrasse-tyson-cant-tell-yH44qh8DpNyfK

Testing

Just do it.

There are plenty of testing frameworks out there. There is one available for your language/framework.

Test serve as a form of docs, and increase confidence, consistency

Sparkle has tests. Catches a lot of bugs before they are released

Version Control All the Things

Git, Github, Bitbucket, mercurial. You NEED to be using version control. Anytime you analyze data, you must have an identifiable version of code associated with that data. I mean version number/ commit id, not some copy of the code saved to some Post-docs laptop. If you publish based on code get a DOI for it.

image : https://education.github.com/

If this is all obvious to you, encourage others by using examples that are relevant to them.

Access

Use Github.

You don't HAVE to use Github, but use some easy-to-find public place to host your code.

Solves the ...its around here somewhere, I think. problem.

Encourage Programming literacy

  • Python, R, Octave, Julia
  • Software Carpentry, Coursera

disclaimer: Huge Python fangirl. Use tools that are available to everyone. Open source ensures no hangups on licensing/ different purchased versions of platforms/toolboxes.

It is less important what you do your work in, than if you provide docs, tests, and version control with it. That being said, using popular languages/frameworks will still give you wider reach.

R compares to SAS, SPSS or Strata Julia is a new, fast and beautiful language -- still some bugs though

Why we use Python...

Doing Data Analysis

Python
  • Numpy
  • Scipy
  • Pandas
  • IPython notebook

I'm going to show some examples of to use some of these packages to create, readable, re-usable code to analyze and visualize data.

There is a wikipedia list (https://en.wikipedia.org/wiki/List_of_statistical_packages)

Example data file:

0.05946
0.05842,0.1589
0.05632
0.00316,0.04972
0.0593
0.06124,0.07648
0.05784

0.04674,0.0602,0.07572,0.12892,0.1964
0.05548

Using pure Python:

spike_times = []
with open('spike_times.csv', 'r') as df:
    reader = csv.reader(df)
    for row in reader:
        floatrow = [float(item) for item in row]
        spike_times.append(floatrow)

all_spike_times = sum(spike_times, [])
# number of spikes per time bin of 5ms
bins = [int(x/0.01) for x in all_spike_times]
bin_counts = [bins.count(i) for i in range(20)]
bin_edges = [i*0.01 for i in range(20)]
print bin_edges, '\n', bin_counts
[0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18]
[9, 8, 67, 23, 8, 5, 4, 7, 10, 5]

Using Pandas and Numpy:

import pandas as pd

spike_table = pd.read_csv('spike_times.csv', sep=','
                          names=range(5))

all_spikes = spike_table.values.flatten()
all_spikes = all_spikes[~np.isnan(all_spikes)]

bin_edges = [i*0.01 for i in range(20)] + [0.2]
spike_bins = pd.cut(all_spikes,bin_edges,labels=False)
bin_counts = np.bincount(spike_bins)
print bin_edges, '\n', bin_counts
[0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2]
[9, 8, 67, 23, 8, 5, 4, 7, 10, 5]

Using R:

data = read.table('spike_times.csv', sep=',', header=FALSE,
                  col.names=1:5,fill=TRUE)

all_spikes = unlist(data)
all_spikes = all_spikes[!is.na(all_spikes)]

results = hist(all_spikes, 20)
print(results['counts'])
print(results['breaks'])

Data Visualization

Python

  • Matplotlib
  • Seaborne
  • Bokeh
  • pyqtgraph

R

Matplotlib

import matplotlib.pyplot as plt

spike_times = []
with open('spike_times.csv', 'r') as df:
    reader = csv.reader(df)
    for row in reader:
        floatrow = [float(item) for item in row]
        spike_times.append(floatrow)

all_spike_times = sum(spike_times, [])

n, bins, patches = plt.hist(all_spike_times, 20, range=(0,0.2))
plt.xlabel("time (s)")
plt.ylabel("no. spikes")
plt.title("Cell Spike Timing");

Seaborn

import seaborn

Pandas

spike_table = pd.read_csv('spike_times.csv', sep=',', names=range(5))

spike_table.plot(kind='hist', bins=20,  range=(0,0.2));
plt.xlabel("time (s)")
plt.ylabel("no. spikes")
plt.title("Cell Spike Timing");

Bokeh

from bokeh.charts import Histogram, show, output_notebook
output_notebook()

spike_times = []
with open('spike_times.csv', 'r') as df:
    reader = csv.reader(df)
    for row in reader:
        floatrow = [float(item) for item in row]
        spike_times.append(floatrow)

all_spike_times = sum(spike_times, [])

hm = Histogram(all_spike_times, bins=20, xlabel='time (s)',
               ylabel='no. spikes', title='Spike timing')
show(hm)

pyqtgraph

R

data = read.table('spike_times.csv', sep=',', header=FALSE,
                  col.names=1:5,fill=TRUE)
all_spikes = unlist(data)
all_spikes = all_spikes[!is.na(all_spikes)]
results = hist(all_spikes, 20)

ggplot2 is a plotting system for R, based on the grammar of graphics

Honorable Mention

Sharing Results

Figshare, Dryad, Dataverse

  • Post your data online when submitting to a journal, or earlier, if possible.
  • Having a system of posting data online, gives a bonus to yourself later when looking up old data, it's not ...somewhere... maybe on Jim's laptop?

Open-access journal PLOS ONE now has a policy requiring its authors to submit relevant data during the review process and recommending they do so by posting their datasets to online repositories like Dryad.

Citizen Science

Citizen science is scientific research conducted, in whole or in part, by amateur or nonprofessional scientists.

SETI@Home (1999)

How can we involve volunteer citizens in traditional scientific research?

Zooniverse

https://www.zooniverse.org/

Galaxy zoo (2007): * > 50 peer-reviewed science papers from results * > 100,000 volunteers, millions of classifications

Snapshot Serengeti (2010-2013): * 225 camera traps across 1,125 km2 in Serengeti National Park, Tanzania, * to study how predators and their prey co-existed across a dynamic landscape. * > 1.2 million pictures * 28,000 users

Gamified!

Foldit: * Protein folding game * improves the pattern-folding algorithms by training

Fraxinus: * Candy Crush Style game that researches genetic variants * that can protect Europe's ash trees from a deadly fungal disease. * Listing "Fraxinus players" as an author on paper, with player names in the supplemental material.

http://www.theguardian.com/technology/2014/jan/25/online-gamers-solving-sciences-biggest-problems.

IRL

  • Public lab

Open source software and hardware kits to monitor air water and land (http://publiclab.org/)

e.g. Deepwater Horizon (http://www.aljazeera.com/indepth/inpictures/2015/04/busting-corporate-polluters-diy-tools-150420132053871.html)

In the hands of citizens, these tools are being used to gather a huge range of environmental data; anything from canopy loss in Peru to industrial pollution in Spain.

Let's make science puppies and rainbows

presentation source

slides on amyboyle.ninja<http://amyboyle.ninja/open_source_science/#/who-am-i>

SpaceForward
Right, Down, Page DownNext slide
Left, Up, Page UpPrevious slide
POpen presenter console
HToggle this help