Identifiers and linking

These notes explore ways to enrich the existing Full Fact Profile of Schema.org to add additional links and identifiers.

It covers:

  • a brief summary of how Schema.org handles identifiers and links
  • linking to future API endpoints
  • linking to external datasets

A quick de-tour on RDF / Linked Data

You can probably skip this, but it might be useful context for what follows!

In RDF, resources either:

  • have a unique URI
  • don’t have a unique URI, but can (sometimes) be uniquely identified via their other properties. Resources without a unique URI are usually called “blank nodes”.

Having a unique URI is usually better than using blank nodes. We can use that URI to publish more data about the same resource (e.g. as Annotations ) and link to the resource from other data. URIs let us build a web of data.

If we don’t have a unique identifier for a resource then we can’t do those things. So we usually need to identify the resource in other ways. This can be done by providing other identifiers. Or by providing equivalence links that explicitly say that two things are the same.

Creating URIs for things and using them to link to other data on the web, is what Linked Data is all about.

Many organisations have struggled to adopt Linked Data. There are many reasons for this, but understanding how to assign and manage unique URIs for things is a common problem.

Schema.org was designed to allow people to publish data on the web without having to create URIs for things. It encourages a pattern of uniquely identifying things based on their properties.

You can publish Linked Data using Schema.org as your schema, but you don’t have to.

There are three core properties that Schema.org uses to help identify things:

  • url – provides a URL for resource. In some cases this might also uniquely identify the resource
  • identifier – refers to any form of non-URI identifier, e.g. a string, number or GUID. There are some sub properties for well-known types of identifier like isbn
  • sameAs – refers to a web page that unambiquously identifies a resource. For example a company homepage or a Wikipedia or Wikidata entry

These three properties help to identify and link together resources without imposing requirements about what form of identifiers to use. Or getting too hung up on which web pages uniquely identify something.

It’s fuzzy and a bit scruffy, but pragmatic when your goal is to let anyone contribute to the web of data.

But, like everything with Schema.org, we can take a more principled approach:

  • when publishing data as JSON-LD, we can assign URIs to key resources using the @id property
  • ensuring that resources always have some form of identifier, whether expressed as an @id, url or identifier to help consumers to manage data they have consumed from a Full Fact webpage or via an API, e.g. to index or use it locally
  • including useful public identifier to facilitate indirectly matching data against other datasets
  • including sameAs references to directly link to useful third-party datasets

Here’s an example of assigning URIs to a Full Fact ClaimReview and Claim by adding an @id property:

{
  "@context": "http://schema.org",
  "@type": "ClaimReview",
  "@id": "https://api.fullfact.org/reviews/56f991a5-fa3b-4333-8425-ead632fd2024"
  "identifier": "56f991a5-fa3b-4333-8425-ead632fd2024",
  "url": "https://fullfact.org/health/german-astrazeneca-8-percent-handelsblatt/"
  "dateModified": "2021-01-26",
  "description": "A report in German newspaper Handelsblatt claimed that...",
  "itemReviewed": {
    "@type": "Claim",
    "@id": "https://api.fullfact.org/reviews/56f991a5-fa3b-4333-8425-ead632fd2024/claims/1",
    ...
  },
  ...
}

In JSON LD, the @id property is used to assign a globally unique URI to a resource. That URI would (ideally) return the data about the ClaimReview (or Claim) and possibly more data than might be embedded by default in the website.

The @id property is ignored by Google. Adding it to the Rich Results Tester does not generate any errors or warnings.

Using the @id attribute in this way allows us to:

  • assign a unique URI to the ClaimReview (or a Claim, or any other resource in the data) in a way that is useful for consumers that want to parse and use the JSON-LD as Linked Data
  • provide a link between data embedded in web pages and data available from the API
  • use JSON-LD as the basic for building a RESTful API for the data, using @id as the means for internal linking between resources. This gives flexibility in how the URI scheme is designed
  • progressively enhance the data more identifiers and internal links, as the API is developed

My suggestion would be to only assign @id to those resources which are created by Full Fact (e.g. ClaimReview, Claim) or where you are collecting or publishing additional data about external resources. Or exposing API endpoints to access data about those resources.

Linking to third-party datasets makes it easier to access and use data from them to add context.

The sameAs property in Schema.org is an example of an equivalence link. It is used to state that “this resource in our data is the same as this resource in their data”.

Where Full Fact has a URI for a resource (or a web page that can serve as a unique identifier) then I recommend always including that in the data via a sameAs property with that URI.

In general I would also recommend adding external identifiers to your data for any person, place, thing or topic that appears in your data. For example:

...
"appearance": [{
  "@type": "CreativeWork",
  "url": "https://www.telegraph.co.uk/news/2021/04/28/teenagers-depression-rates-double-generation-z-spurns-drink/",
  "datePublished": "2021-04-28",
  "author": {
    "@type": "Organization",
    "sameAs": "https://www.wikidata.org/wiki/Q192621",
    "name": "The Telegraph"
  }
}],
...

I would not recommend using these wikidata URIs within an @id property as this allows you to reserve those for internal dataset linking.

Collecting and publishing equivalence links provide useful ways to help disambiguate your data, and allows yourself and users to fetch additional contextual data from other datasets.

These links can also provide the basis for useful API features. For example by allowing users to search for sightings of claims from “Q192621” and not just “The Telegraph”.

Issues with current use of url and name in social media sightings

In some Full Fact fact checks, e.g. see face-masks.jsonld, the appearance data is expressed as follows:

"appearance": [{
  "@type": "CreativeWork",
  "url": "https://twitter.com/jennyrickson",
  "datePublished": "2021-04-20",
  "author": {
    "@type": "Organization",
    "name": "Twitter user"
  }
}]

Whereas others (see homeoffice-data.jsonld) its expressed as follows:

"appearance": [{
  "@type": "CreativeWork",
  "url": "https://www.facebook.com/permalink.php?story_fbid=453177899251788&id=100036787459884",
  "datePublished": "2021-03-26",
  "author": {
    "@type": "Organization",
    "name": "Public"
  }
}]

The use of Organization when actually a Person is making a claim is a known issue that Full Fact plan to address. But there’s a couple of other problems here.

Firstly, the name property is given either as a generic “Twitter user” or “Public”. I gather this is an editorial decision to not name people in the data. But it seems incorrect to include a name here when you are not providing an actual user name. It would be better to either not include the name or use a generic label, for example using the description property.

Secondly, the url should be a link to the sighted claim. There seems to be an inconsistency across Twitter (the first example) and Facebook (the second) about whether direct links are made to posts. Both sightings are posts from individuals.

Based on this, a more correct expression of the first example would be:

"appearance": [{
  "@type": "CreativeWork",
  "url": "https://twitter.com/jennyrickson/status/1384403401908314112",
  "datePublished": "2021-04-20",
  "author": {
    "@type": "Person",
    "sameAs": "https://twitter.com/jennyrickson"
    "description": "Twitter user"
  }
}]

If it is editorial policy not to link to tweets then the following alternative captures the current state more accurately, e.g. a “a twitter user made this claim on 20th April 2021”:

"appearance": [{
  "@type": "CreativeWork"
  "datePublished": "2021-04-20",
  "author": {
    "@type": "Person",
    "sameAs": "https://twitter.com/jennyrickson"
    "description": "Twitter user"
  }
}]

These changes ensure that the url and sameAs properties are used in a consistent way: url is always a link to the specific sighting, and sameAs is always an identifier or a web page that serves that purpose, like a social media profile page.

Where there is editorial policy that impacts what data might be included in the Full Fact profile, I also suggest that these are documented so that any variations across the data are understood by data users.

Linking data takes time and resources. And, while there are many potential datasets to which data can be linked, not all of them are in widescale use or easily accessible.

It’s reasonable to instead link to a smaller number of datasets that give good coverage for the relevant topic area, and which are supported by the necessary licensing, tools and infrastructure that makes linking feasible.

Wikidata has rapidly become a centralised hub for identifiers from many different types of dataset across a range of domains. Linking to Wikidata it is much easier to find links to a wide variety of other datasets.

The project is also actively supported and maintained by the Wikimedia Foundation, making it more sustainable than previously available alternatives like DBPedia.

In the short term, my suggestion would be to focus on linking the Full Fact data to Wikidata. It minimises the integration and maintenance effort whilst maximising impact from creating those links.

Even large specialised datasets like Geonames, which might be a useful alternative target for geographic identifiers, are being catalogued within Wikidata. At the time of writing about 25% of Geonames is already referenced from Wikidata entities.

In the short term, it may be less costly for the Full Fact team to simply edit Wikidata to add resources, where gaps are highlighted, rather than attempt to integrate with a larger number of different datasets.

As Wikidata is an open, collaborative project with public domain licensing, it also provides a useful platform around which to collaborate with other fact checkers, organisations and volunteers. Improving Wikidata to incrementally improve the Full Fact data.

My suggestion here is specifically about providing sameAs links. In other areas of the data it might be useful to link to other government or academic research, but these citations would be modelled differently. We’ll cover enriching the data with additional entities and links in the notes to follow.

Summary of Recommendations

  • Add @id attributes to resources to link to API endpoints containing richer data
  • Assign URIS (via @id) to all resources that are created by Full Fact (e.g. Claim Review, Claim) as well as any resources where you are collecting useful additional data (e.g. the sightings from specific Organisations, etc)
  • Collect external identifiers for key entities, like people, organisations, places, etc
  • Use Wikidata as the primary dataset for obtaining these identifiers
  • Add sameAs links to all resources where you have external identifiers, e.g. as Wikidata links or other suitable URLs
  • Keep sameAs and @id distinct, with the latter being for internal identifiers
  • Revise use of url, name and sameAs links in social media sightings