Identifiers and linking
These notes explore ways to enrich the existing Full Fact Profile of Schema.org to add additional links and identifiers.
It covers:
- a brief summary of how Schema.org handles identifiers and links
- linking to future API endpoints
- linking to external datasets
A quick de-tour on RDF / Linked Data
You can probably skip this, but it might be useful context for what follows!
In RDF, resources either:
- have a unique URI
- don’t have a unique URI, but can (sometimes) be uniquely identified via their other properties. Resources without a unique URI are usually called “blank nodes”.
Having a unique URI is usually better than using blank nodes. We can use that URI to publish more data about the same resource (e.g. as Annotations ) and link to the resource from other data. URIs let us build a web of data.
If we don’t have a unique identifier for a resource then we can’t do those things. So we usually need to identify the resource in other ways. This can be done by providing other identifiers. Or by providing equivalence links that explicitly say that two things are the same.
Creating URIs for things and using them to link to other data on the web, is what Linked Data is all about.
Many organisations have struggled to adopt Linked Data. There are many reasons for this, but understanding how to assign and manage unique URIs for things is a common problem.
Schema.org was designed to allow people to publish data on the web without having to create URIs for things. It encourages a pattern of uniquely identifying things based on their properties.
You can publish Linked Data using Schema.org as your schema, but you don’t have to.
A quick tour of identifiers and links in Schema.org
There are three core properties that Schema.org uses to help identify things:
url
– provides a URL for resource. In some cases this might also uniquely identify the resourceidentifier
– refers to any form of non-URI identifier, e.g. a string, number or GUID. There are some sub properties for well-known types of identifier likeisbn
sameAs
– refers to a web page that unambiquously identifies a resource. For example a company homepage or a Wikipedia or Wikidata entry
These three properties help to identify and link together resources without imposing requirements about what form of identifiers to use. Or getting too hung up on which web pages uniquely identify something.
It’s fuzzy and a bit scruffy, but pragmatic when your goal is to let anyone contribute to the web of data.
But, like everything with Schema.org, we can take a more principled approach:
- when publishing data as JSON-LD, we can assign URIs to key resources using the
@id
property - ensuring that resources always have some form of identifier, whether expressed as an
@id
,url
oridentifier
to help consumers to manage data they have consumed from a Full Fact webpage or via an API, e.g. to index or use it locally - including useful public
identifier
to facilitate indirectly matching data against other datasets - including
sameAs
references to directly link to useful third-party datasets
Using @id
to link to API endpoints
Here’s an example of assigning URIs to a Full Fact ClaimReview
and Claim
by adding an @id
property:
{
"@context": "http://schema.org",
"@type": "ClaimReview",
"@id": "https://api.fullfact.org/reviews/56f991a5-fa3b-4333-8425-ead632fd2024"
"identifier": "56f991a5-fa3b-4333-8425-ead632fd2024",
"url": "https://fullfact.org/health/german-astrazeneca-8-percent-handelsblatt/"
"dateModified": "2021-01-26",
"description": "A report in German newspaper Handelsblatt claimed that...",
"itemReviewed": {
"@type": "Claim",
"@id": "https://api.fullfact.org/reviews/56f991a5-fa3b-4333-8425-ead632fd2024/claims/1",
...
},
...
}
In JSON LD, the @id
property is used to assign a globally unique URI to a resource. That URI would (ideally) return the data about the ClaimReview
(or Claim
) and possibly more data than might be embedded by default in the website.
The @id
property is ignored by Google. Adding it to the Rich Results Tester does not generate any errors or warnings.
Using the @id
attribute in this way allows us to:
- assign a unique URI to the
ClaimReview
(or aClaim
, or any other resource in the data) in a way that is useful for consumers that want to parse and use the JSON-LD as Linked Data - provide a link between data embedded in web pages and data available from the API
- use JSON-LD as the basic for building a RESTful API for the data, using
@id
as the means for internal linking between resources. This gives flexibility in how the URI scheme is designed - progressively enhance the data more identifiers and internal links, as the API is developed
My suggestion would be to only assign @id
to those resources which are created by Full Fact (e.g. ClaimReview
, Claim
) or where you are collecting or publishing additional data about external resources. Or exposing API endpoints to access data about those resources.
Using sameAs
to link to resources in third-party dataset
Linking to third-party datasets makes it easier to access and use data from them to add context.
The sameAs
property in Schema.org is an example of an equivalence link. It is used to state that “this resource in our data is the same as this resource in their data”.
Where Full Fact has a URI for a resource (or a web page that can serve as a unique identifier) then I recommend always including that in the data via a sameAs
property with that URI.
In general I would also recommend adding external identifiers to your data for any person, place, thing or topic that appears in your data. For example:
...
"appearance": [{
"@type": "CreativeWork",
"url": "https://www.telegraph.co.uk/news/2021/04/28/teenagers-depression-rates-double-generation-z-spurns-drink/",
"datePublished": "2021-04-28",
"author": {
"@type": "Organization",
"sameAs": "https://www.wikidata.org/wiki/Q192621",
"name": "The Telegraph"
}
}],
...
I would not recommend using these wikidata URIs within an @id
property as this allows you to reserve those for internal dataset linking.
Collecting and publishing equivalence links provide useful ways to help disambiguate your data, and allows yourself and users to fetch additional contextual data from other datasets.
These links can also provide the basis for useful API features. For example by allowing users to search for sightings of claims from “Q192621” and not just “The Telegraph”.
Issues with current use of url
and name
in social media sightings
In some Full Fact fact checks, e.g. see face-masks.jsonld
, the appearance data is expressed as follows:
"appearance": [{
"@type": "CreativeWork",
"url": "https://twitter.com/jennyrickson",
"datePublished": "2021-04-20",
"author": {
"@type": "Organization",
"name": "Twitter user"
}
}]
Whereas others (see homeoffice-data.jsonld
) its expressed as follows:
"appearance": [{
"@type": "CreativeWork",
"url": "https://www.facebook.com/permalink.php?story_fbid=453177899251788&id=100036787459884",
"datePublished": "2021-03-26",
"author": {
"@type": "Organization",
"name": "Public"
}
}]
The use of Organization
when actually a Person
is making a claim is a known issue that Full Fact plan to address. But there’s a couple of other problems here.
Firstly, the name
property is given either as a generic “Twitter user” or “Public”. I gather this is an editorial decision to not name people in the data. But it seems incorrect to include a name
here when you are not providing an actual user name. It would be better to either not include the name
or use a generic label, for example using the description
property.
Secondly, the url
should be a link to the sighted claim. There seems to be an inconsistency across Twitter (the first example) and Facebook (the second) about whether direct links are made to posts. Both sightings are posts from individuals.
Based on this, a more correct expression of the first example would be:
"appearance": [{
"@type": "CreativeWork",
"url": "https://twitter.com/jennyrickson/status/1384403401908314112",
"datePublished": "2021-04-20",
"author": {
"@type": "Person",
"sameAs": "https://twitter.com/jennyrickson"
"description": "Twitter user"
}
}]
If it is editorial policy not to link to tweets then the following alternative captures the current state more accurately, e.g. a “a twitter user made this claim on 20th April 2021”:
"appearance": [{
"@type": "CreativeWork"
"datePublished": "2021-04-20",
"author": {
"@type": "Person",
"sameAs": "https://twitter.com/jennyrickson"
"description": "Twitter user"
}
}]
These changes ensure that the url
and sameAs
properties are used in a consistent way: url
is always a link to the specific sighting, and sameAs
is always an identifier or a web page that serves that purpose, like a social media profile page.
Where there is editorial policy that impacts what data might be included in the Full Fact profile, I also suggest that these are documented so that any variations across the data are understood by data users.
Which third-party datasets should be used as target for links?
Linking data takes time and resources. And, while there are many potential datasets to which data can be linked, not all of them are in widescale use or easily accessible.
It’s reasonable to instead link to a smaller number of datasets that give good coverage for the relevant topic area, and which are supported by the necessary licensing, tools and infrastructure that makes linking feasible.
Wikidata has rapidly become a centralised hub for identifiers from many different types of dataset across a range of domains. Linking to Wikidata it is much easier to find links to a wide variety of other datasets.
The project is also actively supported and maintained by the Wikimedia Foundation, making it more sustainable than previously available alternatives like DBPedia.
In the short term, my suggestion would be to focus on linking the Full Fact data to Wikidata. It minimises the integration and maintenance effort whilst maximising impact from creating those links.
Even large specialised datasets like Geonames, which might be a useful alternative target for geographic identifiers, are being catalogued within Wikidata. At the time of writing about 25% of Geonames is already referenced from Wikidata entities.
In the short term, it may be less costly for the Full Fact team to simply edit Wikidata to add resources, where gaps are highlighted, rather than attempt to integrate with a larger number of different datasets.
As Wikidata is an open, collaborative project with public domain licensing, it also provides a useful platform around which to collaborate with other fact checkers, organisations and volunteers. Improving Wikidata to incrementally improve the Full Fact data.
My suggestion here is specifically about providing sameAs
links. In other areas of the data it might be useful to link to other government or academic research, but these citations would be modelled differently. We’ll cover enriching the data with additional entities and links in the notes to follow.
Summary of Recommendations
- Add
@id
attributes to resources to link to API endpoints containing richer data - Assign URIS (via
@id
) to all resources that are created by Full Fact (e.g. Claim Review, Claim) as well as any resources where you are collecting useful additional data (e.g. the sightings from specific Organisations, etc) - Collect external identifiers for key entities, like people, organisations, places, etc
- Use Wikidata as the primary dataset for obtaining these identifiers
- Add
sameAs
links to all resources where you have external identifiers, e.g. as Wikidata links or other suitable URLs - Keep
sameAs
and@id
distinct, with the latter being for internal identifiers - Revise use of
url
,name
andsameAs
links in social media sightings