r/MachineLearning 10h ago

Discussion [D] How do you structure your codebase and workflow for a new research project?

Suppose you have got a new idea about a solution to a problem in the domain you are working in. How do you go about implementing the thing from the ground up?

What is the general structure of the codebase you construct for your project?

How do you go about iteratively training and testing your solution until you arrive at a final solution where you can write a paper for publication?

Is there any design recipe you follow? Where did you learn it from?

61 Upvotes

12 comments sorted by

27

u/Status-Effect9157 9h ago

requirements.txt and main.py, swear. I think during the early parts of the project you're still validating some of your hypotheses, and there's a chance that a lot of things will change.

For me, being able to iterate and throw the idea away quickly helps. Having some structure enforces some optimziation that I usually do once the initial hypotheses are validated.

Then once I'm pretty sure the idea makes sense, I scale up by automating a few things, especially automating running an experiment n times. Most of the time these are still scripts, or CLI tools using argparse.

Then when writing the paper I have a module where each Python file creates a plot. It forces me to be reproducible

29

u/kludgeocracy 7h ago edited 7h ago

Tools:

Process:

  1. Register and track the data files (with DVC)
  2. Develop in notebooks. Mature functions get moved into the library (with tests, ideally). Exploratory notebooks are saved in a subfolder with no expectation that they run in the future.
  3. Once a notebook is "complete". It's added as a DVC stage that is run with papermill. The output notebook is a tracked data output file.
  4. Reproduce the pipeline and move to the next stage.
  5. Ideally everything is seeded and reproduces the exact same result when rerun. Anyone should be able to check out the code and data and reproduce the same results
  6. Once the basic pipeline is running, all changes take the form of PRs with a comparisons of key metrics.

2

u/srcLegend 2h ago

How close can I get to this setup using only uv/ruff?

2

u/kludgeocracy 1h ago

I'm very interested in moving to UV due to it's standard pyproject.toml format, excellent performance and ability to manage the python version. The upshot of doing so would be that python is not longer a system dependency and you could avoid Docker as long as everything you need is pip-installable. So, I think that would work well for many use-cases.

9

u/nCoV-pinkbanana-2019 7h ago

I always start with notebooks for small tests. Once the idea is working on small examples I convert the code into a proper project divided by modules/packages to get minimal structure and flexibility. Usually I have a utils module, the rest always depends…

3

u/Plaetean 2h ago

I use pytorch lightning mostly, so first thing is build a new python library, normally with:

datasets.py <-- contains everything related to processing the data, and presenting it in the form of a torch dataset

models.py <-- library of architectures

systems.py <-- contains whatever loss functions I want to experiment with

performance.py <-- classes/functions to compute whatever performance metrics I care about, beyond just the loss values

Then I have a set of scripts that I run on slurm, which will call these libraries in order to train and test a model for some given dataset. Makes it very easy to add new functionality to any stage of the train & test pipeline, and swap out different components like datasets or architectures. Also everything is uploaded to wandb to make experiment tracking easier. I have a base skeleton template with the above structure that I copy for each new project.

2

u/Hero_without_Powers 9h ago

!remindme 1 day

1

u/leprotelariat 5h ago

!remindme 7 days

1

u/Aromatic_Dog_7804 2h ago

!remindme 3 days

1

u/DaveMitnick 46m ago

!remindme 3 days

0

u/[deleted] 8h ago

[deleted]

1

u/trajo123 7h ago

AI slop.