Monitoring media is a common task. Non-profits like the GDELT project and ACLED provide automated solutions that go way beyond sentiment analysis. They're great, but what if you're tasked with solving the problem by yourself?
Google RSS + Newspaper3k + Zero-Shot model get you surprisingly far in classifying hundreds of articles - check out the project on GitHub
I built Iran Media Monitoring as a test to see if you can replicate modern tools with a few basic scripts, and I came to the conclusion that you only need two workflows.
1) Scrape the information environment
The first workflow is a scraper that collects meta data and news articles from RSS feeds. The feed is set to Google RSS but you can change feed in rss.py.
The script below collects links from Google's RSS feed:
def get_rss() -> list[dict]:
print("Fetching RSS")
r = requests.get("https://news.google.com/rss/search?hl=en-US&gl=US&ceid=US:en&q=Iran")
print(r)
root = ET.fromstring(r.content)
result = []
for i in root.findall('.//item'):
source = i.find('source')
url = source.get('url', None)
rss = {
"published_date": i.find('pubDate').text,
"source_name": source.text,
"domain": url,
"title": i.find('title').text,
"link": i.find('link').text,
}
rss = RSS(**rss)
result.append(rss)
return result
Scraping newspapers uses a combination of playwright and Newspaper3k. A simple request is not enough since most sites require Javascript rendering.
Newspaper3k is an amazing library. You can use it to extract title, author, metadata, and body from common tags. This avoids custom scrapers and saves time in a news cycle where things have to be done yesterday.
One of the quirks of using Google's RSS is that the links are hashed and point to Google's servers. I solved this with a give_consent() function that closes Google's consent page, not ideal.
Ideally you'd be able to convert the hashes into the original links to avoid hitting Google's servers. I tried dehashing with a few libraries to avoid getting blocked by Google. It's likely that Google made some breaking changes.
Hashed links limit the number of requests and articles that the scraping script can handle in one go, but this is a small price to pay since concurrency isn't exactly necessary. Scraping one article at a time is still enough.
The script below walks you through each step of the scraping process.
def give_consent(page):
try:
page.locator('button[jsname="b3VHJd"]').first.click()
page.get_by_text("Accept all").first.click()
page.wait_for_load_state("networkidle")
except:
pass
def create_article(page) -> NewsArticle:
try:
article = NewsArticle(page.url)
article.download()
article.parse()
return article
except Exception as e:
return None
def get_publisher(article):
meta = article.meta_data
publisher = meta.get('og', {}).get('site_name') or meta.get('publisher')
return publisher
feed = get_rss()
if not feed:
print("Feed empty")
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
for item in feed:
print(item.link)
page.goto(str(item.link))
give_consent(page)
article = create_article(page)
if not article:
continue
article.nlp()
publisher = get_publisher(article)
article_dict = {
"collectionDate": str(datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S")),
"processed": False,
"title": article.title,
"author": article.authors,
"publishedAt": str(article.publish_date),
"publisher": publisher,
"language": article.meta_lang,
"sourceUrl":article.source_url,
"summary":article.summary,
"keywords": article.keywords,
"description": article.meta_description,
"bodyText":article.text,
}
store = ArticlesStore()
store.insert_article(article_dict)
browser.close()
Sentiment Analysis
The second workflow performs sentiment analysis on collected articles and upserts articles to MongoDB.
The media monitoring tool uses what's called a Zero-Shot model meaning it performs a task it was never trained to do, allowing us to skip using a bunch of training data.
The Zero-Shot model gets a set of instructions and uses its general knowledge to perform a task. In our case it's classifying text.
class SentimentAnalyzer():
def __init__(self):
self.db = ArticlesStore()
self.sentiment_analyzer = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
def process_articles(self):
articles = self.db.fetch_articles()
for article in articles:
if not article.get("processed", False):
sentiment_score = self.sentiment_analyzer(
article['bodyText'],
["positive", "factual", "negative"]
)
result_dict = {
"processed": True,
"analysis.sentiment": {
"tone": sentiment_score['labels'][0],
"score": sentiment_score['scores'][0]
}
}
print(f"Processing {article['_id']}: {result_dict}")
self.db.upsert_article(article["_id"], result_dict)
Fun stats
To be continued....in the meantime clone the repo and have fun!
https://github.com/AlbinTouma/Iran-War-Media.git