Estimating global news coverage with DBPedia and open-sources.

When I joined Comply Advantage's Adverse Media team they explained to me that they wanted to identify gaps in their data but that they had no way of knowing what news sources exist in different countries.

Identifying the world’s news sources is not straightforward since there is no global directory of newspapers, so I designed a research process for collating a comprehensive list of news sources in different countries using open sources.

DBPedia & Wikipedia Lists

Wikipedia is a rich source of information for identifying newspapers in a country. You can use search terms like: “list of newspapers in [country]” to find lists of newspapers in a country.

Wikipedia’s strengths:

Usually labels newspapers as national or regional, aiding in labeling.
Most newspapers have their own dedicated Wikipedia page with details like circulation, owners, and publication status.

You can use the DBPedia scraper to extract information about newspapers in a country:

Add the name of the country to the query.
The script iterates through all hyperlinks on the page.
It extracts information from the infoboxes.

Note:
The DBPedia scraper also extracts irrelevant pages (e.g., towns or regions). You can filter for newspaper type only if you want to avoid cleaning the data. However, filtering reduces the number of hits because DBPedia data isn’t always properly structured. My recommendation is not to filter, as the ratio of irrelevant data to newspapers is low.

Official Directories

Official directories list registered national and regional newspapers. They can often be found via Google searches or on a country’s press ombudsman’s website.

While these directories may only provide the name and domain of media sources, their official status makes them valuable sources of truth.

Open-source Projects and Unofficial Directories

Several unofficial directories and open-source projects contain names and domains of thousands of newspapers worldwide:

WorldMap: An open-source project for global newspapers.
Blog posts for language learners.
Media directories hosted by companies.

These sources can supplement official directories but may require verification before inclusion.

Collating Multiple Sources into a Knowledge Base

To estimate media sources in any country, we create a country research sheet on Google Drive and store data from each directory in separate tabs:

Wikipedia: Use the DBPedia scraper on the country’s “list of newspapers” page. Clean the data to remove ceased publications.
WorldMap: Run the world newspaper map scraper to collect data.
Official Directories: Add data manually or with a script from press directories.
Company Data: Include our existing database in its own tab. Filter by country code and extract domains for that country.
Other Sources: Add any additional identified sources.

Labelling and Collating Sources

The next step is to manually label sources based on our typology, determining their classification.

Process:

Merge all research tabs into a master list that serves as our knowledge base of media sources in a country.
Filter out records that don’t fit our taxonomy.
Label sources by type (e.g., national, regional, tabloid).

The Result

The result is a comprehensive list of:

The adverse media team’s current newspapers.
The sources we believe exist in each country.

With this collated list:

Our sales team can identify the number and types of media sources we cover.
We can spot data gaps and review new sources for inclusion.
Sources rejected in the review are flagged with reasons for exclusion, maintaining a clear and up-to-date database.