Amy Boyle @amylouboyle
I am going to talk making science research more open source. Both from the angle of using existing open source tools, and making your own code open source.
"Stimulus generation was controlled by custom-written software on a personal computer..."
"... as measured by custom-designed software performing a fast Fourier transform of the digitized microphone signal."
With it ushered in a new age of easy widespread sharing of tools and collaboration! Reproducibility abounds!
Change is hard.
Except that didn't happen. Not quite. Change is hard. Still, often the only way to get data or associated code is to contact a paper author directly, and ask for it.
Difference is you can email someone instead of write them * still need to hope they respond in a timely manner * they still have what you want
Science has a reproducibility problem
There is little perceived incentive to spend much of your valuable time on reproducibility.
A 2012 study found that only 25% of the papers they reviewed were reproducible. Another found 10%.
Reproducibility project: conducting a study to investigate the replicability of cancer biology studies. The top 50 most impactful cancer biology studies published between 2010-2012 are being replicated by the Science Exchange network.
Reproducibility Initiative: http://blogs.plos.org/everyone/2012/08/14/plos-one-launches-reproducibility-initiative/
[Sharing Detailed Research Data Is Associated with Increased Citation Rate](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0000308)
Duplication of effort
Docs, tests, version control, Access.
What practices can we promote to get to where we need to be?
This talk is mostly concerned about the source code piece of the reproducibility puzzle. This means docs, tests, version control, and access to the code.
Scientist coders are notorious for not including docs, tests, and version control with their code. These are the things that will make your code reusable to future you and others.
Instructions -- under version control Code docs
image : http://giphy.com/gifs/spoilers -neil-degrasse-tyson-cant-tell-yH44qh8DpNyfK
Just do it.
There are plenty of testing frameworks out there. There is one available for your language/framework.
Test serve as a form of docs, and increase confidence, consistency
Sparkle has tests. Catches a lot of bugs before they are released
Git, Github, Bitbucket, mercurial. You NEED to be using version control. Anytime you analyze data, you must have an identifiable version of code associated with that data. I mean version number/ commit id, not some copy of the code saved to some Post-docs laptop. If you publish based on code get a DOI for it.
image : https://education.github.com/
If this is all obvious to you, encourage others by using examples that are relevant to them.
You don't HAVE to use Github, but use some easy-to-find public place to host your code.
Solves the ...its around here somewhere, I think. problem.
- Python, R, Octave, Julia
- Software Carpentry, Coursera
disclaimer: Huge Python fangirl. Use tools that are available to everyone. Open source ensures no hangups on licensing/ different purchased versions of platforms/toolboxes.
It is less important what you do your work in, than if you provide docs, tests, and version control with it. That being said, using popular languages/frameworks will still give you wider reach.
R compares to SAS, SPSS or Strata Julia is a new, fast and beautiful language -- still some bugs though
Why we use Python...
I'm going to show some examples of to use some of these packages to create, readable, re-usable code to analyze and visualize data.
There is a wikipedia list (https://en.wikipedia.org/wiki/List_of_statistical_packages)
Example data file:
0.05946 0.05842,0.1589 0.05632 0.00316,0.04972 0.0593 0.06124,0.07648 0.05784 0.04674,0.0602,0.07572,0.12892,0.1964 0.05548
Using pure Python:
spike_times =  with open('spike_times.csv', 'r') as df: reader = csv.reader(df) for row in reader: floatrow = [float(item) for item in row] spike_times.append(floatrow) all_spike_times = sum(spike_times, ) # number of spikes per time bin of 5ms bins = [int(x/0.01) for x in all_spike_times] bin_counts = [bins.count(i) for i in range(20)] bin_edges = [i*0.01 for i in range(20)] print bin_edges, '\n', bin_counts
[0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18] [9, 8, 67, 23, 8, 5, 4, 7, 10, 5]
Using Pandas and Numpy:
import pandas as pd spike_table = pd.read_csv('spike_times.csv', sep=',' names=range(5)) all_spikes = spike_table.values.flatten() all_spikes = all_spikes[~np.isnan(all_spikes)] bin_edges = [i*0.01 for i in range(20)] + [0.2] spike_bins = pd.cut(all_spikes,bin_edges,labels=False) bin_counts = np.bincount(spike_bins) print bin_edges, '\n', bin_counts
[0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2] [9, 8, 67, 23, 8, 5, 4, 7, 10, 5]
data = read.table('spike_times.csv', sep=',', header=FALSE, col.names=1:5,fill=TRUE) all_spikes = unlist(data) all_spikes = all_spikes[!is.na(all_spikes)] results = hist(all_spikes, 20) print(results['counts']) print(results['breaks'])
import matplotlib.pyplot as plt spike_times =  with open('spike_times.csv', 'r') as df: reader = csv.reader(df) for row in reader: floatrow = [float(item) for item in row] spike_times.append(floatrow) all_spike_times = sum(spike_times, ) n, bins, patches = plt.hist(all_spike_times, 20, range=(0,0.2)) plt.xlabel("time (s)") plt.ylabel("no. spikes") plt.title("Cell Spike Timing");
spike_table = pd.read_csv('spike_times.csv', sep=',', names=range(5)) spike_table.plot(kind='hist', bins=20, range=(0,0.2)); plt.xlabel("time (s)") plt.ylabel("no. spikes") plt.title("Cell Spike Timing");
from bokeh.charts import Histogram, show, output_notebook output_notebook() spike_times =  with open('spike_times.csv', 'r') as df: reader = csv.reader(df) for row in reader: floatrow = [float(item) for item in row] spike_times.append(floatrow) all_spike_times = sum(spike_times, ) hm = Histogram(all_spike_times, bins=20, xlabel='time (s)', ylabel='no. spikes', title='Spike timing') show(hm)
data = read.table('spike_times.csv', sep=',', header=FALSE, col.names=1:5,fill=TRUE) all_spikes = unlist(data) all_spikes = all_spikes[!is.na(all_spikes)] results = hist(all_spikes, 20)
ggplot2 is a plotting system for R, based on the grammar of graphics
Figshare, Dryad, Dataverse
Open-access journal PLOS ONE now has a policy requiring its authors to submit relevant data during the review process and recommending they do so by posting their datasets to online repositories like Dryad.
Citizen science is scientific research conducted, in whole or in part, by amateur or nonprofessional scientists.
How can we involve volunteer citizens in traditional scientific research?
Galaxy zoo (2007): * > 50 peer-reviewed science papers from results * > 100,000 volunteers, millions of classifications
Snapshot Serengeti (2010-2013): * 225 camera traps across 1,125 km2 in Serengeti National Park, Tanzania, * to study how predators and their prey co-existed across a dynamic landscape. * > 1.2 million pictures * 28,000 users
Foldit: * Protein folding game * improves the pattern-folding algorithms by training
Fraxinus: * Candy Crush Style game that researches genetic variants * that can protect Europe's ash trees from a deadly fungal disease. * Listing "Fraxinus players" as an author on paper, with player names in the supplemental material.
- Public lab
Open source software and hardware kits to monitor air water and land (http://publiclab.org/)
In the hands of citizens, these tools are being used to gather a huge range of environmental data; anything from canopy loss in Peru to industrial pollution in Spain.
slides on amyboyle.ninja<http://amyboyle.ninja/open_source_science/#/who-am-i>
|Right, Down, Page Down||Next slide|
|Left, Up, Page Up||Previous slide|
|P||Open presenter console|
|H||Toggle this help|