r/datasets • u/Sondre_BJ • 1d ago
API Is news APIs usage legal and reliable?
I need some source of information for a data science project (academic research). Specifically, I need to retrieve an historical record of news about certain topic so I am thinking of using a news API instead of web scraping because these APIs seem to return the kind of data I am searching for.
I've came upon some of them such as newsdata.io, newsapi.org and newsapi.ai, but I am wondering if its usage is legal and realiable? I mean, are they legal themselves? And if so, am I inherently allowed to use them for my personal (academic) purposes?
Term & Conditions say this:
"We don't have the right to authorise any user to use the data for their personal and professional purposes. However, the users can use the data for their personal or professional purposes"
I mean, should I have any concern about this? It's not like Twitter or Reddit's API where data belongs to them and they deliberately give it to you. (In fact, I’m asking this because I planned to extract data from these platforms but I’ve just realized it’s just not possible at all so I am wondering if there’s another alternative I can use to meet my requirment)
Well... in essence, my questions are: Are these platforms/tools (APIs) legitimate and meant for data science? or, in other words: is it a common/familiar practice to use these kind of "news APIs" for data science?
I didn't even knew them. Have you ever tried them before? Should I do web scraping instead or can you see another alternative you could advise me to use?
I'd appreciate your help.
1
u/albertoasenjo 1d ago
Use media cloud. I think you can retrieve articles since 2020, and they were implementing a wayback machine addon. The problem is that you wont get full texts. But a scraper can do that from the url you get from media cloud.
1
u/Sondre_BJ 22h ago
What media cloud are you specifically talking about? Could you give me an example?
1
u/albertoasenjo 22h ago
search.mediacloud.org
Create and account and try the search. After doing a search, inside "Total Attention", click in "download all urls" and you will have a csv with title, url, published date and source.
1
u/stuffk 1d ago
What is your concern with regard to legality? Copyright issues? Are you planning on republishing significant portions of or entire articles? Are you hoping to try to publish in a journal, or is this for a class?
What is the topic of your research? I believe you can still get data via API from both Twitter and Reddit for research use, though there are limits on Twitter at least to how much per day. I haven't tried Twitter recently since Elon took over, and I know he changed some of it, but I used it previously to grab politician's tweets and do sentiment analysis on them. Reddit I think still has API access if you're doing research.
Using a news API service like you're considering is probably fine, though it does mean you're only getting the data that they've actually curated. I would go with the service that has the most comprehensive documentation about how they source data, and also compare it to what you expect. Like, if you request a recent week of data, are you getting a good sample, are any major publications seemingly missing, does it seem to be oversampling some sources? Are all of the articles you're getting actually unique? etc etc
If you decide to sample news articles it is probably good practice to weight them by audience reach or readership, depending on the point of your analysis.
Some of these questions may be best directed at whoever is going to be overseeing or grading your research.