There are many ways to organise ipython research project. I am managing a team of 5 Data Scientists and 3 Data Engineers and I found those tips to be working well for our usecase:
This is a summary of my PyData London talk:
http://www.slideshare.net/vladimirkazantsev/clean-code-in-jupyter-notebook
1. Create a shared (multi-project) utils library
You most likely have to reuse/repeat some code in different research projects. Start refactoring those things into "common utils" package. Make setup.py file, push module to github (or similar), so that team members can "pip install" it from VCS.
Examples of functionality to put in there are:
- Data Warehouse or Storage access functions
- common plotting functions
- re-usable math/stats methods
2. Split your fat master notebook into smaller notebooks
In my experience, the good length of file with code (any language) is only few screens (100-400 lines). Jupyter Notebook is still the source file, but with output! Reading a notebook with 20+ cells is very hard. I like my notebooks to have 4-10 cells max.
Ideally, each notebook should have one "hypothesis-data-conclusions" triplet.
Example of splitting the notebook:
1_data_preparation.ipynb
2_data_validation.ipynb
3_exploratory_plotting.ipynb
4_simple_linear_model.ipynb
5_hierarchical_model.ipynb
playground.ipynb
Save output of 1_data_preparation.ipynb to pickle df.to_pickle('clean_data.pkl')
, csv or fast DB and use pd.read_pickle("clean_data.pkl")
at the top of each notebook.
3. It is not Python - it is IPython Notebook
What makes notebook unique is cells. Use them well.
Each cell should be "idea-execution-output" triplet. If cell does not output anything - combine with the following cell. Import cell should output nothing -this is an expected output for it.
If cell have few outputs - it may be worth splitting it.
Hiding imports may or may not be good idea:
from myimports import *
Your reader may want to figure out what exactly you are importing to use the same stuff for her research. So use with caution. We do use it for pandas, numpy, matplotlib, sql
however.
Hiding "secret sauce" in /helpers/model.py is bad:
myutil.fit_model_and_calculate(df)
This may save you typing and you will remove duplicate code, but your collaborator will have to open another file to figure out what's going on. Unfortunately, notebook (jupyter) is quite inflexible and basic environment, but you still don't want to force your reader to leave it for every piece of code. I hope that in the future IDE will improve, but for now, keep "secret sauce" inside a notebook. While "boring and obvious utils" - wherever you see fit. DRY still apply - you have to find the balance.
This should not stop you from packaging re-usable code into functions or even small classes. But "flat is better than nested".
4. Keep notebooks clean
You should be able to "reset & Run All" at any point in time.
Each re-run should be fast! Which means you may have to invest in writing some caching functions. May be you even want to put those into your "common utils" module.
Each cell should be executable multiple times, without the need to re-initialise the notebook. This saves you time and keep the code more robust.
But it may depend on state created by previous cells. Making each cell completely independent from the cells above is an anti-pattern, IMO.
After you are done with research - you are not done with notebook. Refactor.
5. Create a project module, but be very selective
If you keep re-using plotting or analytics function - do refactor it into this module. But in my experience, people expect to read and understand a notebook, without opening multiple util sub-modules. So naming your sub-routines well is even more important here, compared to normal Python.
"Clean code reads like well written prose" Grady Booch (developer of UML)
6. Host Jupyter server in the cloud for the entire team
You will have one environment, so everyone can quickly review and validate research without the need to match the environment (even though conda makes this pretty easy).
And you can configure defaults, like mpl style/colors and make matplot lib inline, by default:
In ~/.ipython/profile_default/ipython_config.py
Add line c.InteractiveShellApp.matplotlib = 'inline'
7. (experimental idea) Run a notebook from another notebook, with different parameters
Quite often you may want to re-run the whole notebook, but with a different input parameters.
To do this, you can structure your research notebook as following:
Place params dictionary in the first cell of "source notebook".
params = dict(platform='iOS',
start_date='2016-05-01',
retention=7)
df = get_data(params ..)
do_analysis(params ..)
And in another (higher logical level) notebook, execute it using this function:
def run_notebook(nbfile, **kwargs):
"""
example:
run_notebook('report.ipynb', platform='google_play', start_date='2016-06-10')
"""
def read_notebook(nbfile):
if not nbfile.endswith('.ipynb'):
nbfile += '.ipynb'
with io.open(nbfile) as f:
nb = nbformat.read(f, as_version=4)
return nb
ip = get_ipython()
gl = ip.ns_table['user_global']
gl['params'] = None
arguments_in_original_state = True
for cell in read_notebook(nbfile).cells:
if cell.cell_type != 'code':
continue
ip.run_cell(cell.source)
if arguments_in_original_state and type(gl['params']) == dict:
gl['params'].update(kwargs)
arguments_in_original_state = False
Whether this "design pattern" proves to be useful is yet to be seen. We had some success with it - at least we stopped duplicating notebooks only to change few inputs.
Refactoring the notebook into a class or module break quick feedback loop of "idea-execute-output" that cells provide. And, IMHO, is not "ipythonic"..
8. Write (unit) tests for shared library in notebooks and run with py.test
There is a Plugin for py.test that can discover and run tests inside notebooks!
https://pypi.python.org/pypi/pytest-ipynb