Exploring Crime Discussions on Swedish Forums Using LLMs
Crime Sifter scrapes crime-related discussions from online forums and uses a locally hosted LLM to identify and classify crimes mentioned in the discussions.
The project was inspired by Sweden's growing crime epidemic. Public access to detailed crime data remains limited and while traditional adverse media screening tools rely on mainstream sources, anonymous forums remain largely untapped for crime intelligence.
Crime Sifter addresses this gap by collecting crime-related threads from Flashback Forum, a popular Swedish discussion board, and identifying crimes in the discussions:
Web Scraping: Utilizing Go Colly to extract thread titles from crime discussion boards and storing them in an SQLite database.
LLM Classification: Passing thread titles through a locally hosted LLM (Llama 3.2 3B Instruct via GPT4ALL) to determine if a crime was mentioned and categorize it accordinglgy
Filtering & Analysis: Storing the LLM’s responses in a crime database for structured analysis of crime trends.
Online Forums
Anonymous forums like 4Chan and Flashback Forum are often analyzed for political sentiment, but their role in crime discussions is relatively underutilised. These platforms host raw, unfiltered discussions where users openly discuss ongoing criminal cases, share unreported incidents, and sometimes even reveal details before they appear in mainstream media.
Given the potential of these forums, I set out to explore whether they could serve as a useful alternative data source for crime analysis.
Using Crime Sifter, I built a corpus of data from crime-related discussions on a well-known Swedish forum—Flashback.
Building a Crime Data Corpus with Signal Sifter
My goal was to apply Signal Sifter to a popular site with regular traffic and extensive discussions on crime in Sweden. After some research, I settled on Flashback Forum, which contains multiple boards dedicated to crime and court cases. These discussions offer a unique, crowdsourced view of crime trends and incidents.
Flashback, like 4Chan, is structured with boards that host various discussion threads. Each thread consists of posts and replies, making it a rich dataset for text analysis. By leveraging web scraping and natural language processing (NLP), I aimed to identify crime mentions in these discussions.
Data Collection and Processing
I used Crime Sifter to scrape crime-related boards on Flashback Forum. The process involved:
Web Scraping: Using Go Colly, I extracted threads from relevant crime boards and stored them in an SQLite database.
LLM Classification: Threads were passed through a locally hosted LLM (Llama 3.2 3B Instruct via GPT4ALL) to identify crimes mentioned in the discussions.
Filtering & Analysis: The model’s responses were stored in a crime database, enabling structured analysis of crime trends.
Data Schema and Key Insights
The collected dataset included both metadata and extracted crime details:
Crime-Related Data:
Crime type
Mentioned locations
Mentioned dates
Metadata:
Number of replies and views (proxy for public interest)
Sentiment analysis
By ranking threads based on views and replies, I assumed that higher engagement correlated with discussions containing significant crime-related information.
Evaluating LLM Effectiveness for Crime Identification
Once I had a corpus of 66,000 threads, I processed them using Llama 3.2B Instruct, running locally to avoid token costs associated with cloud-based models. However, hardware limitations were a major bottleneck—parsing 3,700 thread titles on my 8GB RAM laptop took over eight hours.
Despite the speed limitations, the model performed well in classifying crime mentions. Notably:
It excelled at identifying when no crime was mentioned, avoiding false positives.
However, it struggled with specificity, often labeling both sexual assault and physical assault as generic "Assault."
Sample Output
Thread Title | Identified Crime |
---|---|
24-åring knivskuren i Lund 11 mars | Assault |
Gruppvåldtäkt på 13-åring | Group sexual assault |
Kvinna rånad och dödad i Malmö | Homicide |
Stenkastning i Rinkeby mot polisen | Arson |
Bilbomb i centrala London | Bomb threat |
Vem är dörrvakten? | No crime |
Takeaways and Future Work
This experiment demonstrated that online forums can provide valuable crime-related insights. Using LLMs to classify crime discussions is effective but resource-intensive. Future improvements could include:
Fine-tuning the model for better crime categorization.
Exploring more efficient LLM hosting solutions.
Expanding data collection to include post content beyond just thread titles.
Sweden’s crime data challenges persist, but alternative sources like anonymous forums offer new opportunities for OSINT and risk analysis. By refining these methods, we can improve crime trend monitoring and enhance investigative research.
This work is part of an ongoing effort to explore unconventional data sources for crime intelligence. If you're interested in OSINT, adverse media analysis, or data-driven crime research, feel free to connect!