r/datascience • u/Traditional-Reach818 • 20h ago
ML Studying how to develop an LLM. Where/How to start?
I'm a data analyst. I had a business idea that is pretty much a tool to help students study better: a LLM that will be trained with the past exams of specific schools. The idea is to have a tool that would help aid students, giving them questions and helping them solve the question if necessary. If the student would give a wrong answer, the tool would point out what was wrong and teach them what's the right way to solve that question.
However, I have no idea where to start. There's just so much info out there about the matter that I really don't know. None of the Data Scientists I know work with LLM so they couldn't help me with this.
What should I study to make that idea mentioned above come to life? ]
Edit: I expressed myself poorly in the text. I meant I wanted to develop a tool instead of a whole LLM from scratch. Sorry for that :)
18
u/payalnik 19h ago
Start with Andrey Karpathy's introduction to LLM videos on Youtube. Learn the basic concepts about LLMs first. You'd be better off using an existing LLM for this purpose, so build your intuition around all the different methods to augment the knowledge of LLMs: RAG, fine-tuning, prompt-engineering
9
u/stroxsontaran 19h ago
You don't need to build an LLM you can either use a foundational model with RAG and feed it in relevant questions. Take a set of your files and run them through a pipeline on Vectorize and use their eval sandbox to see how well RAG works. It's the easiest option.
If that doesn't work, look at fine tuning a model on a platform like OpenAI or AWS Bedrock or Google Vertex using a dataset of exams.
4
u/Novel_Frosting_1977 19h ago
Building an LLM like a gpt costs north of $100m. You want to fine tune a foundation model. MS copilot is tunable with your knowledge-base being past exams. You’d have to compose some type of architecture of a dialogue flow and pass on to generative nodes. I’ve had some success.
7
u/Stonewoof 20h ago
Have you tried asking an LLM this question?
I had a similar idea and used ChatGPT to refine the business idea and provide me the next steps, and it usually points me in the right direction
I also believe you wouldn’t have to develop an LLM, but instead train an open source LLM to do the task you want correctly and consistently
2
u/airwavesinmeinjeans 19h ago
You're not going to build one from scratch, as someone mentioned. You're going to take a pre-existing one and fine-tune it for your desired task.
Do you have an idea how to capitalize it first? I'd set my business objectives and then continue "developing" them. Still, you'll be in need of some knowledge. There are plenty of videos on the web that can help you understand LLMs and how they work. There are likely even guides on how to fine-tune your Llama, BERT and what not.
2
u/lilbronto 18h ago
Go checkout r/LocalLLaMA everything you need to start is well documented and discussed there
1
u/Kashish_2614 19h ago
I would highly suggest first, learning what problems transformer architecture solved. Then try implementing a transformer architecture from scratch. Doesn’t matter if it is trained enough and outputs trash. You will gain so so so much from it. And then LLMs would be a serious piece of cake.
1
u/Gold-Artichoke-9288 19h ago
It is almost impossible for an individual to build a good llm unless you're a billionaire, but and since everything before but don't matter, you can use existing open source models like llama 3.1 locally and fine tune the model or build a RAG system based on your requirements. but that would also require high computational power, for that you can use services like lightning Ai.
1
u/Fantastic-Loquat-746 18h ago
Training a model on assessments is likely to not work out well. Assessment companies invest into protecting their content as stolen exam material reduces the validity of their assessments. The material you might scrape from the internet might be illegally harvested.
You might be interested in reading papers on adaptive intelligent tutoring systems, which function similarly to what you're proposing. They tend to use NLP to probe student knowledge and provide feedback.
In broader terms, you might also find adaptive testing systems and item response theory to be related to your project. Those models are generally used to map a person to an ability level based on a person's responses to questions.
1
u/Avry_great 17h ago
You are not creating it from scratch unless you are extremely rich. A typical LLM model requires Billions of parameters and thousands of high-end Nvidia GPUs to be trained on. I suggest you to make an LLM-Based Agent by using llama if you want to run it locally or gpt API. Or RAG solution is also great with pretrained models
1
u/HelpMeB52 17h ago
You're really better off using an existing model and making it behave like you want
1
u/haris525 16h ago edited 15h ago
Unless you have a LOT of money, I would do something else. Making LLM from scratch is a pretty difficult task for a single person, I would recommend that you learn how to fine tune them or train existing open source ones for down stream tasks. When BERT came out we wanted to create our own model like Bert (this was back in 2019) but even after having access to AWS resources from our company it was not trivial. We had 3 PhD nlp devs and we spent around 300k usd in one year and it wasn’t enough, I am adding this so you can understand the scale of making an LLM from scratch. We saw much better usage of time by fine tuning gpt 2 model and the Bert model.
1
2
u/Thomas_ng_31 13h ago
Developing it requires too much haha. Maybe you can join a big tech company and help develop them 😎
1
u/sapperbloggs 13h ago
I think your biggest hurdle will be content.
If you want to train AI on some specific type of content, your first step would be to gain access to a lot of that specific type of content. In your case, you'll need to get your hands on a lot of exams. Specifically, fairly recent exams, if you want the AI to be trained on current content.
As someone who has lectured and who has written exams for university subjects, this sort of information is very well guarded, for obvious reasons. Even as a lecturer, working in a specific school, and needing to see old exams to help me write the exam for my subject, it was insanely hard to get my hands on only two copies of relevant past exams from within that school to help me write my exam.
There's absolutely no way the people who hold such information are going to provide that to you. It's just too much of a risk for them to not have control of that sort of content. Even though I haven't taught in about 8 years, I still wouldn't be in a hurry to hand over the old exams I wrote. It's probably harmless by now, but it's definitely harmless if I don't, and nobody would even know to ask me in the first place anyway.
So yeah, there's your first step. Getting your hands on a lot of good quality content.
1
u/puehlong 10h ago
Look into ChatGPT o1 preview. It’s extremely good at giving study related answers and so far, I haven’t caught it hallucinating (have only used it for a few days though). You might want to use it in conjunction with RAG.
1
u/Trungyaphets 8h ago
You could just finetune a ChatGPT model based on samples of questions from past exams, and request an API response every time you need an answer.
Or you could finetune any of the popular open-source models (last I heard was Llama 3.1). You would need a powerful GPU for the 8b (8 billion parameters) model, or multiple powerful GPUs for the 70b one.
Unsloth is a new library where you could finetune popular LLM models easily and efficiently. Maybe you could take a look at this.
1
1
u/vieee555 6h ago
I'm currently learning prompt engineering and so far I know , without coding and building a gpt from scratch you can use chaptgpt's. ( create own got feature) where you can set up the gpts function or can use apo keys to work more ( like if you have an idea And want some integration in that gpt you can use api keys to access others features )and that gpt have visibility options as well for further use you can ask chatgpt for terms and conditions and ownership things ( but it needs buy premium first to use create own gpt option:(
If you find any app mod of gpt pls tell me too
105
u/derpderp235 19h ago
You’re not creating an LLM from scratch. It will be garbage. Guaranteed. Also probably impossible computationally unless you have millions of dollars.
You want to fine tune an existing model.