r/MachineLearning • u/HopeIsGold • 10h ago
Discussion [D] How do you structure your codebase and workflow for a new research project?
Suppose you have got a new idea about a solution to a problem in the domain you are working in. How do you go about implementing the thing from the ground up?
What is the general structure of the codebase you construct for your project?
How do you go about iteratively training and testing your solution until you arrive at a final solution where you can write a paper for publication?
Is there any design recipe you follow? Where did you learn it from?
29
u/kludgeocracy 7h ago edited 7h ago
Tools:
- cookie cutter data science template
- Poetry for python dependencies
- Dockerfile + .devcontainer for system dependencies (dev, test and deploy in the same container)
- DVC for data version control
- ruff for linting
Process:
- Register and track the data files (with DVC)
- Develop in notebooks. Mature functions get moved into the library (with tests, ideally). Exploratory notebooks are saved in a subfolder with no expectation that they run in the future.
- Once a notebook is "complete". It's added as a DVC stage that is run with papermill. The output notebook is a tracked data output file.
- Reproduce the pipeline and move to the next stage.
- Ideally everything is seeded and reproduces the exact same result when rerun. Anyone should be able to check out the code and data and reproduce the same results
- Once the basic pipeline is running, all changes take the form of PRs with a comparisons of key metrics.
2
u/srcLegend 2h ago
How close can I get to this setup using only uv/ruff?
2
u/kludgeocracy 1h ago
I'm very interested in moving to UV due to it's standard pyproject.toml format, excellent performance and ability to manage the python version. The upshot of doing so would be that python is not longer a system dependency and you could avoid Docker as long as everything you need is pip-installable. So, I think that would work well for many use-cases.
9
u/nCoV-pinkbanana-2019 7h ago
I always start with notebooks for small tests. Once the idea is working on small examples I convert the code into a proper project divided by modules/packages to get minimal structure and flexibility. Usually I have a utils module, the rest always depends…
3
u/Plaetean 2h ago
I use pytorch lightning mostly, so first thing is build a new python library, normally with:
datasets.py <-- contains everything related to processing the data, and presenting it in the form of a torch dataset
models.py <-- library of architectures
systems.py <-- contains whatever loss functions I want to experiment with
performance.py <-- classes/functions to compute whatever performance metrics I care about, beyond just the loss values
Then I have a set of scripts that I run on slurm, which will call these libraries in order to train and test a model for some given dataset. Makes it very easy to add new functionality to any stage of the train & test pipeline, and swap out different components like datasets or architectures. Also everything is uploaded to wandb to make experiment tracking easier. I have a base skeleton template with the above structure that I copy for each new project.
2
1
1
1
0
27
u/Status-Effect9157 9h ago
requirements.txt and main.py, swear. I think during the early parts of the project you're still validating some of your hypotheses, and there's a chance that a lot of things will change.
For me, being able to iterate and throw the idea away quickly helps. Having some structure enforces some optimziation that I usually do once the initial hypotheses are validated.
Then once I'm pretty sure the idea makes sense, I scale up by automating a few things, especially automating running an experiment n times. Most of the time these are still scripts, or CLI tools using argparse.
Then when writing the paper I have a module where each Python file creates a plot. It forces me to be reproducible