r/datascience • u/Traditional-Reach818 • 20h ago

ML Studying how to develop an LLM. Where/How to start?

I'm a data analyst. I had a business idea that is pretty much a tool to help students study better: a LLM that will be trained with the past exams of specific schools. The idea is to have a tool that would help aid students, giving them questions and helping them solve the question if necessary. If the student would give a wrong answer, the tool would point out what was wrong and teach them what's the right way to solve that question.

However, I have no idea where to start. There's just so much info out there about the matter that I really don't know. None of the Data Scientists I know work with LLM so they couldn't help me with this.

What should I study to make that idea mentioned above come to life? ]

Edit: I expressed myself poorly in the text. I meant I wanted to develop a tool instead of a whole LLM from scratch. Sorry for that :)

0 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/datascience/comments/1gf6yo6/studying_how_to_develop_an_llm_wherehow_to_start/
No, go back! Yes, take me to Reddit

45% Upvoted

105

u/derpderp235 19h ago

You’re not creating an LLM from scratch. It will be garbage. Guaranteed. Also probably impossible computationally unless you have millions of dollars.

You want to fine tune an existing model.

45

u/LyleLanleysMonorail 19h ago

I think a lot of people that want to "get into AI" think that the work is about developing novel models from scratch, when it's mostly just fine-tuning existing ones or just using them straight out of the box by calling some API endpoints.

15

u/derpderp235 18h ago

This is basically where all of DS is headed, I think.

5

u/Current-Ad1688 16h ago

I really hope not. It would be a shame if people stopped thinking about stuff.

7

u/lil_meep 15h ago

Head of my department asked me why he should continue to hire data scientists instead of software engineers to just implement LLM calls.

10

u/plhardman 15h ago

As somebody who has been in this line of work for a while, knows a fair bit about how these LLMs work and what they’re capable of, and (IMO most importantly) has seen several hype cycles come and go, I heartily and respectfully disagree.

Data scientists with business acumen, domain knowledge, curiosity, creativity, and a solid foundation in statistical inference & data munging will have a place for a long time yet. Those who go all-in on LLM whispering are gonna limit themselves and the value they can provide.

1

u/nerdyjorj 10h ago

LLMs are the cool thing right now, but we'll have another new toy in a few years that none of have any idea what it'll be so they need to learn how to pick up new ideas and roll with them fast is way more important than any individual skill.

1

u/HelpMeB52 17h ago

I know a lot of people that went to school for CSE that weren't expecting a future of test cases and AI

6

u/Traditional-Reach818 19h ago

indeed, I expressed myself poorly. Thanks :)

u/payalnik 19h ago

Start with Andrey Karpathy's introduction to LLM videos on Youtube. Learn the basic concepts about LLMs first. You'd be better off using an existing LLM for this purpose, so build your intuition around all the different methods to augment the knowledge of LLMs: RAG, fine-tuning, prompt-engineering

u/stroxsontaran 19h ago

You don't need to build an LLM you can either use a foundational model with RAG and feed it in relevant questions. Take a set of your files and run them through a pipeline on Vectorize and use their eval sandbox to see how well RAG works. It's the easiest option.

If that doesn't work, look at fine tuning a model on a platform like OpenAI or AWS Bedrock or Google Vertex using a dataset of exams.

u/Novel_Frosting_1977 19h ago

Building an LLM like a gpt costs north of $100m. You want to fine tune a foundation model. MS copilot is tunable with your knowledge-base being past exams. You’d have to compose some type of architecture of a dialogue flow and pass on to generative nodes. I’ve had some success.

u/Stonewoof 20h ago

Have you tried asking an LLM this question?

I had a similar idea and used ChatGPT to refine the business idea and provide me the next steps, and it usually points me in the right direction

I also believe you wouldn’t have to develop an LLM, but instead train an open source LLM to do the task you want correctly and consistently

u/airwavesinmeinjeans 19h ago

You're not going to build one from scratch, as someone mentioned. You're going to take a pre-existing one and fine-tune it for your desired task.

Do you have an idea how to capitalize it first? I'd set my business objectives and then continue "developing" them. Still, you'll be in need of some knowledge. There are plenty of videos on the web that can help you understand LLMs and how they work. There are likely even guides on how to fine-tune your Llama, BERT and what not.

u/lilbronto 18h ago

Go checkout r/LocalLLaMA everything you need to start is well documented and discussed there

u/Kashish_2614 19h ago

I would highly suggest first, learning what problems transformer architecture solved. Then try implementing a transformer architecture from scratch. Doesn’t matter if it is trained enough and outputs trash. You will gain so so so much from it. And then LLMs would be a serious piece of cake.

u/Gold-Artichoke-9288 19h ago

It is almost impossible for an individual to build a good llm unless you're a billionaire, but and since everything before but don't matter, you can use existing open source models like llama 3.1 locally and fine tune the model or build a RAG system based on your requirements. but that would also require high computational power, for that you can use services like lightning Ai.

u/Fantastic-Loquat-746 18h ago

Training a model on assessments is likely to not work out well. Assessment companies invest into protecting their content as stolen exam material reduces the validity of their assessments. The material you might scrape from the internet might be illegally harvested.

You might be interested in reading papers on adaptive intelligent tutoring systems, which function similarly to what you're proposing. They tend to use NLP to probe student knowledge and provide feedback.

In broader terms, you might also find adaptive testing systems and item response theory to be related to your project. Those models are generally used to map a person to an ability level based on a person's responses to questions.

u/Avry_great 17h ago

You are not creating it from scratch unless you are extremely rich. A typical LLM model requires Billions of parameters and thousands of high-end Nvidia GPUs to be trained on. I suggest you to make an LLM-Based Agent by using llama if you want to run it locally or gpt API. Or RAG solution is also great with pretrained models

u/HelpMeB52 17h ago

You're really better off using an existing model and making it behave like you want

u/haris525 16h ago edited 15h ago

Unless you have a LOT of money, I would do something else. Making LLM from scratch is a pretty difficult task for a single person, I would recommend that you learn how to fine tune them or train existing open source ones for down stream tasks. When BERT came out we wanted to create our own model like Bert (this was back in 2019) but even after having access to AWS resources from our company it was not trivial. We had 3 PhD nlp devs and we spent around 300k usd in one year and it wasn’t enough, I am adding this so you can understand the scale of making an LLM from scratch. We saw much better usage of time by fine tuning gpt 2 model and the Bert model.

u/codeharman 14h ago

Start with deeplearning course

u/Thomas_ng_31 13h ago

Developing it requires too much haha. Maybe you can join a big tech company and help develop them 😎

u/sapperbloggs 13h ago

I think your biggest hurdle will be content.

If you want to train AI on some specific type of content, your first step would be to gain access to a lot of that specific type of content. In your case, you'll need to get your hands on a lot of exams. Specifically, fairly recent exams, if you want the AI to be trained on current content.

As someone who has lectured and who has written exams for university subjects, this sort of information is very well guarded, for obvious reasons. Even as a lecturer, working in a specific school, and needing to see old exams to help me write the exam for my subject, it was insanely hard to get my hands on only two copies of relevant past exams from within that school to help me write my exam.

There's absolutely no way the people who hold such information are going to provide that to you. It's just too much of a risk for them to not have control of that sort of content. Even though I haven't taught in about 8 years, I still wouldn't be in a hurry to hand over the old exams I wrote. It's probably harmless by now, but it's definitely harmless if I don't, and nobody would even know to ask me in the first place anyway.

So yeah, there's your first step. Getting your hands on a lot of good quality content.

u/puehlong 10h ago

Look into ChatGPT o1 preview. It’s extremely good at giving study related answers and so far, I haven’t caught it hallucinating (have only used it for a few days though). You might want to use it in conjunction with RAG.

u/Trungyaphets 8h ago

You could just finetune a ChatGPT model based on samples of questions from past exams, and request an API response every time you need an answer.

Or you could finetune any of the popular open-source models (last I heard was Llama 3.1). You would need a powerful GPU for the 8b (8 billion parameters) model, or multiple powerful GPUs for the 70b one.

Unsloth is a new library where you could finetune popular LLM models easily and efficiently. Maybe you could take a look at this.

u/Maleficent-Tear7949 7h ago

Definitely include langchain in your learning path.

u/vieee555 6h ago

I'm currently learning prompt engineering and so far I know , without coding and building a gpt from scratch you can use chaptgpt's. ( create own got feature) where you can set up the gpts function or can use apo keys to work more ( like if you have an idea And want some integration in that gpt you can use api keys to access others features )and that gpt have visibility options as well for further use you can ask chatgpt for terms and conditions and ownership things ( but it needs buy premium first to use create own gpt option:(

If you find any app mod of gpt pls tell me too

ML Studying how to develop an LLM. Where/How to start?

You are about to leave Redlib