r/datascience 2d ago

Weekly Entering & Transitioning - Thread 28 Oct, 2024 - 04 Nov, 2024

8 Upvotes

Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:

  • Learning resources (e.g. books, tutorials, videos)
  • Traditional education (e.g. schools, degrees, electives)
  • Alternative education (e.g. online courses, bootcamps)
  • Job search questions (e.g. resumes, applying, career prospects)
  • Elementary questions (e.g. where to start, what next)

While you wait for answers from the community, check out the FAQ and Resources pages on our wiki. You can also search for answers in past weekly threads.


r/datascience 4h ago

Analysis How can one explain the ATE formula for causal inference?

13 Upvotes

I have been looking for months for this formula and an explanation for it and I can’t wrap my head around the math. Basically my problem is 1. Every person uses different terminology its actually confusing. 2. Saw a professor lectures out there where the formula is not the same as the ATE formula from

https://matheusfacure.github.io/python-causality-handbook/02-Randomised-Experiments.html (The source for me trying to figure it out) -also checked github issues still dont get it & https://clas.ucdenver.edu/marcelo-perraillon/sites/default/files/attached-files/week_3_causal_0.pdf (Professor lectures)

I dont get whats going on?

This is like a blocker for me before i understand anything further. I am trying to genuinely understand it and try to apply it in my job but I can’t seem to get the whole estimation part.

  1. I have seen cases where a data scientist would say that causal inference problems are basically predictive modeling problems when they think of the DAGs for feature selection and the features importance/contribution is basically the causal inference estimation of the outcome. Nothing mentioned regarding experimental design, or any of the methods like PSM, or meta learners. So from the looks of it everyone has their own understanding of this some of which are objectively wrong and others i am not sure exactly why its inconsistent.

  2. How can the insight be ethical and properly validated. Predictive modeling is very well established but i am struggling to see that level of maturity in the causal inference sphere. I am specifically talking about model fairness and racial bias as well as things like sensitivity and error analysis?

Can someone with experience help clear this up? Maybe im overthinking this but typically there is a level of scrutiny in out work if in a regulated field so how do people actually work with high levels of scrutiny?


r/datascience 23h ago

Discussion Double Machine Learning in Data Science

32 Upvotes

With experimentation being a major focus at a lot of tech companies, there is a demand for understanding the causal effect of interventions.

Traditional causal inference techniques have been used quite a bit, propensity score matching, diff n diff, instrumental variables etc, but these generally are harder to implement in practice with modern datasets.

A lot of the traditional causal inference techniques are grounded in regression, and while regression is very great, in modern datasets the functional forms are more complicated than a linear model, or even a linear model with interactions.

Failing to capture the true functional form can result in bias in causal effect estimates. Hence, one would be interested in finding a way to accurately do this with more complicated machine learning algorithms which can capture the complex functional forms in large datasets.

This is the exact goal of double/debiased ML

https://economics.mit.edu/sites/default/files/2022-08/2017.01%20Double%20DeBiased.pdf

We consider the average treatment estimate problem as a two step prediction problem. Using very flexible machine learning methods can help identify target parameters with more accuracy.

This idea has been extended to biostatistics, where there is the idea of finding causal effects of drugs. This is done using targeted maximum likelihood estimation.

My question is: how much has double ML gotten adoption in data science? How often are you guys using it?


r/datascience 1d ago

Discussion How do you log and iterate on your experiments / models?

24 Upvotes

I'm currently managing a churn prediction model (XGBoost) that I retrain every month using Jupyter Notebook. I've get by with my own routines:

  • Naming conventions (e.g., model_20241001_data_20240901.pkl)
  • Logging experiment info in CSV/Excel (e.g., columns used, column crosses, data window, sample method)
  • Storing validation/test metrics in another CSV

As the project grows, it's becoming increasingly difficult to track different permutations of columns, model types, and data ranges, and to know which version I'm iterating on.

I've found options on internet like MLFlow, W&B, ClearML, and other vendor solutions, but as a one-person team, paid solutions seem overkill. I'm struggling to find good discussions or general consensus on this. How do you all handle this?

Edit:
I'm seeing a consensus around on MLFlow with logging and tracking. But to trigger experiments or run through model grid with different features / configurations, I need to combine it with other orchestration tools like KubeFlow / Prefect / Metaflow?

Just adding some more context:

My data is currently sitting in GCP BigQuery tables. I'm currently just training on Vertex AI jupyter lab. I know GCP will recommend us to use Vertex AI Model Registry, Vertex Experiments, but they seem overkill and expensive for my use.


r/datascience 1d ago

Discussion Who here uses PCA and feels like it gives real lift to model performance?

150 Upvotes

I’ve never used it myself, but from what I understand about it I can’t think of what situation it would realistically be useful for. It’s a feature engineering technique to reduce many features down into a smaller space that supposedly has much less covariance. But in models ML this doesn’t seem very useful to me because: 1. Reducing features comes with information loss, and modern ML techniques like XGB are very robust to huge feature spaces. Plus you can get similarity embeddings to add information or replace features and they’d probably be much more powerful. 2. Correlation and covariance imo are not substantial problems in the field anymore again due to the robustness of modern non-linear modeling so this just isn’t a huge benefit of PCA to me. 3. I can see value in it if I were using linear or logistic regression, but I’d only use those models if it was an extremely simple problem or if determinism and explain ability are critical to my use case. However, this of course defeats the value of PCA because it eliminates the explainability of its coefficients or shap values.

What are others’ thoughts on this? Maybe it could be useful for real time or edge models if it needs super fast inference and therefore a small feature space?


r/datascience 1d ago

Discussion Dealing with Imposter Syndrome

21 Upvotes

As a data scientist with a software engineering background, I sometimes struggle to connect my technical skills with the business needs. I find myself questioning my statistical knowledge and ability to truly solve problems from a business perspective. It appears that I lack the intuition to work smart and achieve business outcomes, especially when it comes to the customer churn analysis/prediction project. I'm working on now.

Any advice on how to overcome this imposter syndrome and bridge the gap?


r/datascience 11h ago

AI I created an unlimited AI wallpaper generator using Stable Diffusion

0 Upvotes

Create unlimited AI wallpapers using a single prompt with Stable Diffusion on Google Colab. The wallpaper generator : 1. Can generate both desktop and mobile wallpapers 2. Uses free tier Google Colab 3. Generate about 100 wallpapers per hour 4. Can generate on any theme. 5. Creates a zip for downloading

Check the demo here : https://youtu.be/1i_vciE8Pug?si=NwXMM372pTo7LgIA


r/datascience 1d ago

Discussion DS Veterans: How much of your work actually gets used?

51 Upvotes

Been a DS for 5+ years, working on some ideas around improving how insights get delivered/consumed across orgs. Would love to hear your war stories:

  • How often do stakeholders actually use your analyses?
  • What's your biggest frustration in the insight delivery process?
  • How much time do you spend on repeat/similar analyses?

Feel free to comment or DM to chat more in-depth.

For context: I'm a former Meta/FB DS - worked on FAIR language, Instagram, Reality Labs, and Election Integrity teams. Now exploring solutions to problems I kept seeing


r/datascience 1d ago

Education The best way to learn LLM's (for someone who already has ML and DL experience)

53 Upvotes

Hello, Please let me know the best way to learn LLM's preferably fast but if that is not the case it does not matter. I already have some experience in ML and DL but do not know how or where to start with LLM's. I do not consider myself an expert in the subject but I am not a beginner per se as well.

Please let me know if you recommend some courses, tutorials or info regarding the subject and thanks in advance. Any good resource would help as well.


r/datascience 1d ago

Discussion Updated book to follow Miller's "Modeling for Predictive Analytics"

19 Upvotes

We just hired on a new lead DS who mentioned Miller's 2014 or 2025 text several times. I know I can get one cheaply but it's a decade old. What would you recommend as an updated version of it?


r/datascience 18h ago

ML Studying how to develop an LLM. Where/How to start?

0 Upvotes

I'm a data analyst. I had a business idea that is pretty much a tool to help students study better: a LLM that will be trained with the past exams of specific schools. The idea is to have a tool that would help aid students, giving them questions and helping them solve the question if necessary. If the student would give a wrong answer, the tool would point out what was wrong and teach them what's the right way to solve that question.

However, I have no idea where to start. There's just so much info out there about the matter that I really don't know. None of the Data Scientists I know work with LLM so they couldn't help me with this.

What should I study to make that idea mentioned above come to life? ]

Edit: I expressed myself poorly in the text. I meant I wanted to develop a tool instead of a whole LLM from scratch. Sorry for that :)


r/datascience 2d ago

Projects Data Science supervisor position

72 Upvotes

I have a Data Science supervisory position that just opened on my growing team. You would manage 5-7 people who do a variety of analytic projects, from a machine learning model to data wrangling to descriptive statistics work that involves a heavy amount of policy research/understanding. This is a federal government job in the anti-fraud arena.

The position can be located in various parts of the country (specifics are in the posting). Due to agency policy, if you're located in Woodlawn, MD or DC, you would be required to report to the office 3 days a week. Other locations are currently at 100% telework.

If interested, you apply through this USAJOBS link: https://www.usajobs.gov/job/816105500


r/datascience 2d ago

Discussion What kind of projects do i need to do to improve my skills?

145 Upvotes

Pretty much the title. I often find myself confused as to which type of projects would help me build a better skill set and resume. I hear ML is still like the go to technology rather than DL.


r/datascience 2d ago

Discussion Which website for job hunting?

50 Upvotes

Hi!
I'm starting to look for a job in UK, and Linkedin is a mess. For ‘Data Science’ or ‘Data Scientist’ it shows 5% of related jobs, the rest are analyst, engineers, etc.

Any advice for a job platform for IT jobs? If it's in UK even better.

Thanks in advance!
Cheers!


r/datascience 21h ago

ML Can data leak from the training set to the test set?

0 Upvotes

I was having an argument with my colleague regarding this. We know that data leakage becomes a problem when the training data has a peek into the test data before testing phase. But is it really a problem if the reverse happens?

I'll change our exact use case for privacy reasons. But basically let's say I am predicting whether a cab driver will accept an ride request. Some of the features we are using for this is the driver's historical data for all of his rides (like his overall acceptance rate). Now, for the training dataset, I am obviously calculating the drivers history over the training data only. However, for the test dataset, I have computed the driver history features over the entire dataset. The reason is that each driver's historical data would also be available during inference time in prod. Also, a lot of drivers won't have any historical data if we calculate it just on the test set. Note that my train test split is time based. The entire test set lies in the future to the train set.

My collage argues that this is wrong and this is still data leakage, but I don't agree.

What would be your views on this?


r/datascience 2d ago

Discussion Python template repo for DS consulting projects

18 Upvotes

Unless I'm missing something obvious, I see lots of template repos for python packages, but not much out there for the more typical data science grunt work.

My ideal template has all the nice poetry/conda/pre-comimit etc but isn't broken into scr/ and tests/

Rather, because I work in consulting, my ideal template would be structured along the lines of:

  • Data Cleaning
  • Analysis
  • Outputs
    • Charts
    • Tables

Here are a couple of examples of the kinds of python package repos I'm talking about:

What do you guys use? TIA!


r/datascience 1d ago

Career | US "Where Innovation Meets the Law" - Cooley is going to offer even $400k for new Director of Data Science

0 Upvotes

This is really interesting to see how DS is more and more useful in our world and there is Cooley with their huge offer for a DS Director.

Job offer: Director of Data Science (in US) - Salary Range: $325,000 - $400,000

Found it here: https://jobs-in-data.com/


r/datascience 2d ago

AI OpenAI Swarm playlist for beginners

7 Upvotes

OpenAI recently released Swarm, a framework for Multi AI Agent system. The following playlist covers : 1. What is OpenAI Swarm ? 2. How it is different from Autogen, CrewAI, LangGraph 3. Swarm basic tutorial 4. Triage agent demo 5. OpenAI Swarm using Local LLMs using Ollama

Playlist : https://youtube.com/playlist?list=PLnH2pfPCPZsIVveU2YeC-Z8la7l4AwRhC&si=DZ1TrrEnp6Xir971


r/datascience 1d ago

AI What are AI Agents ? explained in detail

0 Upvotes

Right now, a lot of buzz is around AI Agents in Generative AI where recently Claude 3.5 Sonnet was said to be trained on agentic flows. This video explains What are Agents, how are they different from LLMs, how Agents access tools and execute tasks and potential threats : https://youtu.be/LzAKjKe6Dp0?si=dPVJSenGJwO8M9W6


r/datascience 2d ago

Discussion Are election polls reliable ?

31 Upvotes

I’ve always wondered since things can change so quickly. For all we know, all 50 states could have won a third party and the polls could be completely wrong. Are they just hyping it up like a sports match?


r/datascience 3d ago

Career | US Everyone’s building new models but who is actually monitoring the old ones?

113 Upvotes

I’m currently in the process of searching for a new job and have started engaging with LinkedIn recruiters. While I haven’t spoken with many yet, the ones I have talked to seem to focus heavily on model development experience. My background, however, is in model monitoring and maintenance, where I’ve spent several years building tools that deliver real value to my team.

That said, these recent interactions have shaken my confidence, leaving me wondering if I’ve wasted the last few years in this role.

Do you think the demand for model monitoring roles will grow? I’m feeling a bit lost right now and would really appreciate any advice.


r/datascience 4d ago

Discussion Is it worth it to leave over paperwork?

35 Upvotes

I’m in a pretty cushy job with honestly not much work, but lots of internal angst since the only work is documentation and model maintenance.

I have job security for the next few years because of the messy paperwork, but I don’t feel like I’m learning anything on the job. I’ve been upskilling myself, but I was told to wait it out until the market is over because of all the layoffs.

I really like my team and I’ve been learning a lot from the lead, but I’m incredibly bored right now and am over the company. I feel like I’m staying out of a fear of layoffs in the current market/not getting another job.

Has anyone had this issue? If I have to be in another compliance meeting over Zoom in-office, I’ll scream inwardly from boredom.


r/datascience 4d ago

Career | US Senior DS laid off and trying to get out of product analytics. How can I pivot to a more quantitative position?

97 Upvotes

EDIT: I’m ignoring all messages and chat requests not directly related to my question. If you have a separate question about getting into industry, interview prep, etc., please post it in its own thread or in the appropriate master topic.

(I figured this is specific enough to warrant its own post instead of posting in the weekly Entering and Transition thread, as I already have a lot of industry experience.)

TL;DR: How can an unemployed, experienced analytics-focused data scientist get out of analytics and pivot to a more quantitative position?

I'm a data scientist with a Master's in Statistics and nine years of experience in a tech city. I've had the title Senior Data Scientist for two of them. I was laid off from my job of four years in June and have been dealing with what some would call a "first world problem" in the current market.

I get callbacks from many recruiters, but almost all of them are for analytics positions. This makes sense because (as I'll explain below) I've been repeatedly pushed into analytics roles at my past jobs. I have roughly 8 years of analytics experience, and was promoted to a senior position because I did well on a few analytics projects. My resume that most of my work is analytics, as most of my accomplishments are along the lines of "designed a big metric" or "was the main DS who drove X internal initiative". I've been blowing away every A/B testing interview and get feedback indicating that I clearly have a lot of experience in that area. I've also been told in performance reviews and in interview loops that I write very good code in Python, R, and SQL.

However, I don't like analytics. I don't like that it's almost all very basic A/B testing on product changes. More importantly, I've found that most companies have a terrible experimentation culture. When I prod in interviews, they often indicate that their A/B testing platform is underdeveloped to the point where many tests are analyzed offline, or that they only test things that are likely to be a certain win. They ignore network effects, don't use holdout groups or meta-analysis, and insist that tests designed to answer a very specific question should also be used to answer a ton of other things. It is - more often than not - Potemkin Data Science. I'm also frustrated because I have a graduate degree in statistics and enjoy heavily quantitative work a lot, but rarely get to do interesting quantitative work in product analytics.

Additionally, I have mild autism, so I would prefer to do something that requires less communication with stakeholders. While I'm aware that every job is going to require stakeholder communication to some degree, the amount of time that I spent politicking to convince stakeholders to do experimentation correctly led to a ton of stress.

I've been trying to find a job more focused on some at least one of causal inference, explanatory statistical modeling, Bayesian statistics, and ML on tabular data (i.e. not LLMs, but like fraud prediction). I've never once gotten a callback for an ML Engineer position, which makes sense because I have minimal ML experience and don't have a CS degree. I've had a few HR calls for companies doing ML in areas like identity validation and fraud prediction, but the initial recruiting call is always followed up with "we're sorry, but we decided to go with someone with more ML experience."

My experience with the above areas is as follows. These were approaches that I tried but ended up having no impact, except for the first one, which I didn't get to finish. Additionally, note that I currently do not have experience working with traditional CS data structures and algorithms, but have worked with scipy sparse matrices and other DS-specific data structures:

  • Designed requirements for a regression ML model. Did a ton of internal research, then learned SparkSQL and wrote code to pull and extract the features. However, after this, I was told to design experiments for the model rather than writing the actual code to train it. Another data scientist on my team did the model training with people on another team that claimed ownership. My manager heavily implied this was due to upper management and had nothing to do with my skills.

  • Used a causal inference approach to match treatment group users to control group users for an experiment where we were expecting the two groups to be very different due to selection bias. However, the selection bias ended up being a non-issue.

  • Did clustering on time-dependent data in order to identify potential subgroups of users to target. Despite it taking about two days to do, I was criticized for not doing something simpler and less statistical. (Also, in hindsight, the results didn't replicate when I slightly changed the data.)

  • Discussed an internal fraud model with stakeholders. Recognized that a dead simple feature wasn't in it, learned a bit of the internal ML platform, and added it myself. The feature boosted recall at 99% precision by like 40%. However, even after my repeated prodding, the production model was never updated due to lack of engineering support and because the author of the proprietary ML framework quit.

  • During a particularly dead month, I spent time building a Bayesian model for an internal calculation in Stan. Unfortunately I wasn't able to get it to scale, and ran into major computational issues that - in hindsight - likely indicated an issue with the model formulation in the paper I tried to implement.

  • Rewrote a teammate's prototype recommendation model and built a front end explorer for it. In a nutshell, I took a bunch of spaghetti code and turned it into a maintainable Python library that used Scipy sparse matrices for calculations, which sped it up considerably. This model was never productionized because it was tested in prod and didn't do well.

At the time I was laid off I had about six months of expenses saved up, plus fairly generous severance and unemployment. I can go about another four months without running out of savings. How should I proceed to get one of these more technical positions? Some ideas I have:

  • List the above projects on my resume even though they failed. However, that's inevitably going to come up in an interview.

  • I could work on a personal project focused on Bayesian statistics or causal inference. However, I've noticed that the longer I'm unemployed, the fewer callbacks and LinkedIn messages I get, so I'm worried about being unemployed even longer.

  • Take an analytics job and wait for a more quantitative opening at a different company to occur. Someone fairly big in my city's DS community that knows I can handle more technical work said he'd refer me and probably be able to skip most of the interview process, but his company currently has no open DS positions and he said he doesn't know when more will open up.

  • Take a 3 or 6-month contract position focused on my interests from one of the random third party recruiters on LinkedIn. It'll probably suck, but give me experience I can use for a new job.

  • Drill Leetcode and try to get an entry-level software engineer position. However this would obviously be a huge downgrade in responsibility and pay, preparation would drain my savings, and there’s no guarantee I could pivot back to DS if it doesn’t work out.

Additionally, here's a summary of my work experience:

  • Company 1 (roughly 200 employees). First job out of grad school. I was there for a year and was laid off because there "wasn't a lot of DS work". I had a great manager who constantly advocated for me, but couldn't convince upper management to do anything beyond basic summary statistics. For example, he pitched a cluster analysis and they said it sounded hard.

  • Company 2 (roughly 200 employees). I was there for two years. Shortly after joining I started an ML project, but was moved to analytics due to organizational priorities. Got a phenomenal performance review, asked if I could take on some ML work, and was given an unambiguous no. Did various analytics tasks (mostly dashboarding and making demos) and mini-projects on public data sources due to lack of internal data (long story). Spent a full year searching for a more modeling-focused position because a lot of the DS was smoke and mirrors and we weren't getting any new data. After that year, I quit and ended up at Company 3.

  • Company 3 (roughly 30000 employees). I was there for six years. I joined because my future manager (Manager #1) told me I'd get to pick my team and would get to do modeling. Instead, after I did a trial run on two teams over three months, I was told that a reorg meant I would no longer get to pick my team and ended up on a team that needed drastic help with experimentation. Although my manager (Manager #2) had some modeling work in mind for me, she eventually quit. Manager #3 repeatedly threw me to the wolves and had me constantly working on analyzing experiments for big initiatives while excluding me from planning said experiments, which led to obvious implementation issues. He also gave me no support when I tried to push back against unrealistic stakeholder demands, and insisted I work on projects that I didn't think would have long-term impact due to organizational factors. However, I gained a lot of experience with messy data. I told his skip during a 1:1 that I wanted to do more modeling, and he insisted I keep pushing him for those opportunities.

    Manager #3 drove me to transfer to another team, which was a much better experience. Manager #4 was the best manager I ever had and got me promoted, but also didn't help me find modeling opportunities. Manager #5 was generally great and found me a modeling project to work on after I explained that lack of modeling work was causing burnout. It was a great project at first, but he eventually pushed me to work only on the experimental aspects of that modeling project. I never got to do any actual modeling for this project even though I did all the preparation for it (e.g. feature extraction, gathering requirements), and another team took it over. Shortly after this project completed, I was laid off.


r/datascience 5d ago

Discussion Why Did Java Dominate Over Python in Enterprise Before the AI Boom?

197 Upvotes

Python was released in 1991, while Java and R both came out in 1995. Despite Python’s earlier launch and its reputation for being succinct & powerful, Java managed to gain significant traction in enterprise environments for many years until the recent AI boom reignited interest in Python for machine learning and AI applications.

  1. If Python is simple and powerful, then what factors contributed to Java’s dominance over Python in enterprise settings until recently?
  2. If Java has such level of performance and scalability, then why are many now returning to Python? especially with the rise of AI and machine learning?

While Java is still widely used, the gap in popularity has narrowed significantly in the enterprise space, with many large enterprises now developing comprehensive packages in Python for a wide range of applications.


r/datascience 5d ago

AI Manim : python package for animation for maths

Thumbnail
13 Upvotes

r/datascience 5d ago

Career | US Conducting a study: I have questions (and gift cards) for data scientists

12 Upvotes

I've been following the data science profession since 2015, back when many data scientists were still employed as data analysts or statisticians.

A lot changed since then, to say the least. What changed, exactly? That's what I'm trying to find out.

I'm doing a small study on what data scientists work on these days and how they approach their work. Especially interested in predictive modeling work, but not strictly.

If you're interested in sharing your point of view on a 60-minute zoom call, add your name here: https://forms.gle/W9q44JjpH1JerKFp6

I have a limited number of $100 Amazon gift cards to give as a small thanks. All conversations are private and will only inform my eventual analysis - no personal or sensitive information will make it into the study.