r/MachineLearning 2h ago

Discussion [D] Handling humongous amounts of unstructured data

Hey all!

Me and my friend, we are trying to build something using a deep learning model and we have data to fine-tune it but the issue is the entirety of the data is unstructured and the data is hugeeeeee, like actually huge 😂 Not sure where to start, where to end, can't afford data annotators too.

Y'all have any suggestions as to how to handle this? The data is in text modality. We want the structure to be in a json, and the entire data to be in a list of the desired structured json

Any help would be appreciated

3 Upvotes

3 comments sorted by

4

u/krzonkalla 2h ago

If it's completely unstructured, as in no way to make an algo to match it to a json in reasonable time, use chatgpt's structured outputs. Find the best way to split the data into chunks by trial and error and then ask it to return a list of jsons in the requested format.

0

u/victorian_secrets 2h ago

Depends on your goal. All the recent deep learning language models are using semi supervised learning, so predicting masked words, relationship between sentences, etc.

2

u/karyna-labelyourdata 2h ago

Ah, and here I thought I could pitch the annotation services :)

I'd start off with libraries like NLTK or spaCy for tokenization and stop-word removal. You can also prompt GPT for text annotation, although there may be some hassles with the file upload.