What are some of the best practices for sharing data and models with colleagues?

Question

As a data scientist who recently joined a new team, I wanted to ask the community how they share data and models among their colleagues. Currently I have to resort to storing data in some central server or location where all of us can access (which means unix permissions etc). For models I also tend to send a weights file over to my colleague and share my github. Both I've found pretty cumbersome. What have some of you done?

xtian · Answer

Listening to the podcast Partially Derivative Episode "Data Science On The Silicon Beach" the host interviews Maksim Percherskiy, Chief Data Officer for the City of San Diego.

Talking about the stack he uses for the City of San Diego: (08:50) "The way we move data around ... we use Airflow [...] and Airflow is just Python." Percherskiy continues characterizing the data sharing problem in the context of city government.

Emre · Answer

For big files, I use cloud storage (Google, Amazon, Microsoft, or whichever ecosystem your company's on), with folders named after the issue/project ticket name/number. These services support file versioning, by the way. Small files I just attach to the ticket. If have to share something small and transient with a handful of people I can use email or our corporate chat application.

Answered by Emre on March 31, 2021

Rishabh Batra · Answer

To store and share the data amongst colleagues, cloud storage is the option we use (s3, google storage) where you can just have a folder structure to store all your datasets. While there is no specific way to share models, it totally depends on the model type, one thing that's used is making a binary of model (pickle in python) and share that file which you can also encrypt in case you are floating around sensitive data.

In case it is an unsupervised learning model you can directly share the codebase.

Nick Corona · Answer

You can try using dvc, which stands for data version control.
https://dvc.org/

Piotr Rarus · Answer

I'd say this question is much broader, than simple sharing files. How do you perform research with team in agile fashion? Sadly, there's just a few solutions on the market. As mentioned before, most of the people hack it using already available services. Some time ago I've stumbled upon neptune.ml. It looks nice, though is quite pricey. Most of the time I'm doing something similar. I try to stick to git-flow convention and I have separate folder in repo named research, next to vanilla git-flow branches. Also keep your data in separate repo, so you know against which data version you ran your experiments.

Answered by Piotr Rarus on March 31, 2021

lcrmorin · Answer

Honestly once you get to something serious (big or evolving) the main problem is about sharing and updating data. Once a solution is devised for data, it is not really hard to adapt it to models.
Depending on the volume and the usage, the data can be stored, exchanged and acessed in a wide range of solutions. It might be old historical/external solution with associated langages (SAS), internal data bases on linux server with a 'lab' interface, clouds solutions or just csv file for tinier data sets. Once this is arranged and the solution to access those data is devised is it rather easy to adapt the solution to models. Sometimes it demand a bit of involvment to handle the rights properly but it shouldn't be too much of a hassle if the architecture is set with that goal in mind.
So, regarding models, depending on the context, the solution might be sharing SAS programs on a shared workspace, pickled python models on a linux server or simply Excel files trough mail. More recently, new 'tools' appeared and result in new solution : share experiments
trough notebooks (Jupyter notebooks or R Markdown) and use Git for versionning models.

What are some of the best practices for sharing data and models with colleagues?

6 Answers

Add your own answers!

Ask a Question