Sunday, December 30, 2018

Truly Reproducible Research Papers

A slide from Prof. Barry Smyth's presentation
If you perform an experiment and get some interesting results which cannot be redone and get the same results by somebody else, something is wrong with your finding. This is called reproducibility of research. If it is not reproducible, it is not science. You might think that systematic research carried out by academics and professional scientists who publish papers in conferences and journals are doing reproducible research. Not really.

Majority of research papers I've come across in my own domain are bare descriptions and explanations of their results without proper support for reproduction of the results by anybody interested. Even though a paper with a good quality provides a lot of details of their experimental setups and settings, it difficult to truly recreate their results completely based on the details in the paper. It is often necessary to contact the authors and have a correspondence back and forth several times to get things clear. Similarly, if I ask myself whether I can reproduce a research work I had published few years ago solely based on the details I had put down on my own paper, I have to give a big 'No' unfortunately.

This is a bad way to do science.

It is unfair to computer scientist if I say they are not putting any effort to make their research reproducible. There are two important ways they try do it these days. The first is giving away data sets that they had collected. This allows third parties to verify their results and also to extend and build upon it. The second is to provide the source codes of the experimental implementations they have made. They usually put their codes into a Github repository and provide the link in their research papers so that readers can find the source code repository and reuse their code.

Another slide from Prof. Barry Smyth's presentation
Recently I attended to a talk delivered by Prof. Barry Smyth in UCD, Ireland where he suggested two interesting ways to make our research papers reproducible. The first is a practice which is much simpler and easier to do. That is to produce a Jupyter Notebooks along the scientific publication which has both software codes, data, descriptions and explanations in a well documented manner which a third party can quickly run and build upon. If you haven't used or read about Jupyter Notebooks, have a look at the first link in the references section. It's a way to produce well documented software codes where you have your software codes, their descriptions and their output in a report-like format.

There's even more powerful way of making reproducible research papers. Imagine you are producing a research paper where the paper talks about a 30% improvement in something. How to enable the reader to verify whether this number is truly 30% by using their own experimental data? If I'm giving away the source codes of my implementations, does the reader has sufficient information to locate the correct programs and execute them in the correct sequence in order to get the final result? This is where the tool "Kallysto" comes in. It is a tool developed by Prof. Barry in order to make scientific publications fully reproducible and traceable. Kallysto combines Latex with Jupyter Notebooks in such a way, your Latex manuscript is directly linked with the original data and the software codes which analyze them. While the typical workflow of writing a research paper is to (1) analyze data, (2) produce graphs as images or PDF files, and finally (3) create a Latex manuscript which explicitly include those graphs. When you compile your Latex source files, Kallysto will run the Jupyter Notebooks analyzing data and generates the results in real-time which will be used by Latex to produce the final PDF document.

The idea of Prof. Barry Smyth is to make scientific publications truly reproducible by scripting everything from the data to results and finally to latex documents.


[1] Jupyter notebooks

[2] Netflix Papermil tool

[3] The tool made by Prof. Barry Smyth called Kallysto

Friday, December 21, 2018

Google Colaboratory Notebook with Data from Google Drive

Until recently, I was using individual Python scripts with data files here and there for data analysis in my research works. After understanding the power of having a better documentation of my experiments along with the codes and graphs, I started to use Jupyter Notebooks. However, I still had a limitation. Jupyter Notebook works on my local computer with the data files. Every time I do some analysis, I have to do it locally and upload the results to Google Drive as a backup. Whenever I want to work on it again, I have to download the Jupyter Notebook and the data files which is a big hassle.

Today I realized that Google provides an online tool to run Python Notebooks while the data and the Notebook file is still in the Google Drive. There's no requirement to download my data and Python scripts to local computer each time I want to do some analysis. Here's how we use Google Colaboratory for this purpose.

(1) Create a directory in the Google Drive where I want to create my Colab Notebook. Let's say I've created the directory "Google-Colab-Demo" in the following location.

My Drive > UCD > Asanka's PhD > Experiments > Google-Colab-Demo

(2) Right-click inside the created directory and select Colaboratory from the menu. It will open a new web browser tab with a new Notebook. Give a name to the notebook. I'll set it to plotting.ipynb

(3) In the local computer, create a text file with the name data.csv and add the following content. Then upload it into the above directory we created in Google Drive.


(4) Add a text cell and provide some details about what we are going to do.

(5) Add a code cell and insert the following code into it. Note that the full path to the data.csv file can be extracted by right-clicking on the data file on the file browser in the left hand size pane and then selecting the menu option Copy path.

import matplotlib.pyplot as plt
import numpy as np

x, y = np.loadtxt("/content/drive/My Drive/UCD/Asanka's PhD/Experiments/Google-Colab-Demo/data.csv", delimiter=',', unpack=True)
plt.plot(x,y, label='Loaded from file!')
plt.title('Interesting Graph\nCheck it out')

(6) Now, run the code segment by clicking on the Play icon on the left corner of the code cell. The resulting graph will appear like the following.

Some extra work...

Sometimes, you can have a Jupyter Notebook with data in the local computer which you have copied to Google Drive. Now you want to run the same Jupyter Notebook in Google Colab. In that case, first you need to right-click on the Jupyter Notebook on Google Drive and open it with Collaboratory. If your Google Drive does not appear to be mounted automatically in the File browser pane, follow the stesps given below.

(7) Mounting Google Drive into the Notebook by running following code. It will prompt for an authentication code which should be typed in. In the left hand side corner of the screen, a file browser should be available now with the access to the google drive files in the mounted location.

from google.colab import drive

(8) If we want to import another Python file in the Google Drive as a Python module within our Notebook, first give the path to the location of the Python file. Then import the Python module as in a normal Python program.

import sys
import my-module

Now everything should be good to go like a normal Colab Notebook.