Identifying Crime Related Data from Anonymous Social Media with AI

Crime Sifter scrapes crime-related discussions from online forums and uses a locally hosted LLM to identify and classify crimes mentioned in the discussions.

The project was inspired by Sweden's growing crime epidemic. Public access to detailed crime data remains limited, and while traditional adverse media screening tools rely on mainstream sources, anonymous forums remain largely untapped for crime intelligence.

Crime Sifter addresses this gap by collecting crime-related threads from Flashback Forum, a popular Swedish discussion board, and identifying crimes in the discussions:

Web Scraping: Utilizing Go Colly to extract thread titles from crime discussion boards and storing them in an SQLite database.

LLM Classification: Passing thread titles through a locally hosted LLM (Llama 3.2 3B Instruct via GPT4ALL) to determine if a crime was mentioned and categorize it accordingly.

Filtering & Analysis: Storing the LLM’s responses in a crime database for structured analysis of crime trends.

Online Forums

Anonymous forums like 4Chan and Flashback Forum are often analyzed for political sentiment, but their role in crime discussions is relatively underutilized. These platforms host raw, unfiltered discussions where users openly discuss ongoing criminal cases, share unreported incidents, and sometimes even reveal details before they appear in mainstream media.

Given the potential of these forums, I set out to explore whether they could serve as a useful alternative data source for crime analysis.

Using Crime Sifter, I built a corpus of data from crime-related discussions on Flashback.

Building a Crime Data Corpus with Signal Sifter

My goal was to apply Signal Sifter to a popular site with regular traffic and extensive discussions on crime in Sweden. After some research, I settled on Flashback Forum, which contains multiple boards dedicated to crime and court cases. These discussions offer a unique, crowdsourced view of crime trends and incidents.

Flashback, like 4Chan, is structured with boards that host various discussion threads. Each thread consists of posts and replies, making it a rich dataset for text analysis. By leveraging web scraping and natural language processing (NLP), I aimed to identify crime mentions in these discussions.

Data Collection and Processing

I used Crime Sifter to scrape crime-related boards on Flashback Forum. The process involved:

Web Scraping: Using Go Colly, I extracted threads from relevant crime boards and stored them in an SQLite database.
LLM Classification: Threads were passed through a locally hosted LLM (Llama 3.2 3B Instruct via GPT4ALL) to identify crimes mentioned in the discussions.
Filtering & Analysis: The model’s responses were stored in a crime database, enabling structured analysis of crime trends.

Data Schema and Key Insights

The collected dataset included both metadata and extracted crime details:

Crime-Related Data:

Crime type
Mentioned locations
Mentioned dates

Metadata:

Number of replies and views (proxy for public interest)
Sentiment analysis

By ranking threads based on views and replies, I assumed that higher engagement correlated with discussions containing significant crime-related information.

Evaluating LLM Effectiveness for Crime Identification

Once I had a corpus of 66,000 threads, I processed them using Llama 3.2B Instruct, running locally to avoid token costs associated with cloud-based models. However, hardware limitations were a major bottleneck—parsing 3,700 thread titles on my 8GB RAM laptop took over eight hours.

Despite the speed limitations, the model performed well in classifying crime mentions. Notably:

It excelled at identifying when no crime was mentioned, avoiding false positives.
However, it struggled with specificity, often labeling both sexual assault and physical assault as generic "Assault."

Sample Output

Thread Title	Identified Crime
24-åring knivskuren i Lund 11 mars	Assault
Gruppvåldtäkt på 13-åring	Group sexual assault
Kvinna rånad och dödad i Malmö	Homicide
Stenkastning i Rinkeby mot polisen	Arson
Bilbomb i centrala London	Bomb threat
Vem är dörrvakten?	No crime

Takeaways and Future Work

This experiment demonstrated that online forums can provide valuable crime-related insights. Using LLMs to classify crime discussions is effective but resource-intensive.

Future improvements could include:

Fine-tuning the model for better crime categorization.
Exploring more efficient LLM hosting solutions.
Expanding data collection to include post content beyond just thread titles.

Sweden’s crime data challenges persist, but alternative sources like anonymous forums offer new opportunities for OSINT and risk analysis. By refining these methods, we can improve crime trend monitoring and enhance investigative research.

This work is part of an ongoing effort to explore unconventional data sources for crime intelligence.

If you're interested in OSINT, adverse media analysis, or data-driven crime research, feel free to connect!