An API based on Claim Review

Currently, Claim Review data is primarily published and consumed via JSON-LD embedded in web pages. While this helps to expose structured data to search engines and browsers, this approach doesn’t support other possible use cases.

One way to achieve that is through an API. So what might an API for ClaimReview data look like?

These are some initial thoughts and suggestions that could inform the design of such an API. They’re not intended to provide a complete design.

There is plenty of good guidance around designing good, RESTful APIs. I won’t rehash that here, but instead focus on suggestions that are more tailored to exposing Claim Review metadata.

Offering an API versus a data dump?

Before jumping into implementing an API, we should consider whether this is the right solution. An API undoubtedly provides a flexible way to find and extract data, e.g. to support finding recent fact checks, or claims made by a specific person or organisation.

But some use cases often need access to the full dataset. For example, aggregating and combining the data with other sources. In these cases a data dump providing a regularly updated snapshot of the data would be more useful than requiring the user to harvest data via an API. It may also be more efficient to produce than scaling an API to support multiple applications harvesting the data.

Offering a data dump alongside an API, would be a useful additional access method.

Some basic design principles

The following suggestions build on the recommendations found in the rest of this series of notes.

Use JSON-LD as the API response format

Offering a consistent data format across all methods of accessing data (e.g. embedded in web pages, exposed via an API or included in a data dump) will:

  • avoid the need to document, standardise and support multiple formats
  • allow data to be consistently presented and interpreted, by ensuring it is presented consistently across access methods
  • allow consumers to easily move between methods of consuming data without rewriting code to parse and interpret it
  • offer a simple JSON interface for applications that need it, whilst allowing others to interpret the data as a graph

Use the @id attribute in JSON-LD data to refer to API endpoints for specific resources. This will:

  • support linking across API endpoints, by allowing users to follow links to obtain the detailed representation of a resource (e.g. a ClaimReview or Claim)
  • provide a means of connecting data embedded in web pages with the API endpoints
  • support a Linked Data style presentation of the data, with unique URIs for key resource

Use list endpoints to expose collections of resources

The API will need to provide a means for clients to page through lists of results, e.g. after querying for recent fact checks.

The Hydra specification offers some vocabulary and a model for providing access to collections of resource.

This builds on the common pattern of using “next” and “previous” links between pages of results.

Use the API as a means of exposing richer data

The data embedded in web pages will need to continue to conform to the expectations of Google and other search engines.While search engines will likely ignore unknown properties, there is more freedom to explore exposing richer data via an API.

I would suggest including richer data and non-standard elements via the API (at least initially) then consider adding to the embedded JSON-LD blocks when it is required by search engines.

Some elements of the dataset, e.g. fact checking interventions may only ever be exposed via the API.

Constraining the amount of data embedded in pages will also help reduce overall page weight.

As noted in the discussion on reusing fact checks some additional properties will be needed to support traversal across the graph of relationships defined within the Claim Review model. Ideally these would be defined by Schema.org to avoid the need for custom properties.

Expose all of the key resources in the Claim Review model

Reviewing the current model and proposed ways of enriching the core data, I would suggest exposing endpoints for all of the following:

  • Fact checking articles (e.g. FactCheckingArticle as noted here) – to allow users to find lists of articles published by Full Fact
  • Claim Reviews (ClaimReview, e.g. /claim-reviews) – to support users in finding individual reviews, separately to the blog posts they are contained in
  • Claims (Claim, e.g. /claims) – to allow querying for individual claims
  • Appearances (CreativeWork) – to list the posts, articles, etc that have been referenced as sighting of a claim

With the addition of suitable filters these entry points would provide useful slices of the content.

Supporting filtering based on property values and external identifiers

Allow filtering of lists of articles, claims, etc by a range of filters provided as URL parameters. I would suggest supporting the following:

  • by date – to filter articles, reviews, claims, appearances based on when they were published and/or modified
  • by author – to filter claims and appearances based on author, to allow users to find claims made by a specific person
  • by topic – to filter claims based on the subject (DefinedTerm) used to categorise them
  • keyword – a free text search over textual properties, e.g. descriptions of claims, or the body of a fact checking article

Additional filters could allow further restrictions, e.g. appearances that have been corrected, or from specific platforms (based on their URL). But the initial suggestions provide useful ways of refining a list to items of interest.

Where there are links to external Wikidata urls, e.g. for people, organisations or topics, then I recommend allowing filtering based on both their identifiers (e.g. P2561) and their labels.

API Documentation

There are many useful tools for generating interactive and navigable API documentation. These help users quickly and easily explore an API to test queries and view results.

But some additional reference documentation would help users understand the overall shape of the data, how it is created and how it might be revised. This would include things like:

  • a summary of editorial policy that informs how, or whether individuals posting material to social media are named in the data
  • a summary of editorial policy that informs linking to citations and datasets
  • an indication of when richer information, like sameAs links can be expected to be available for different types of resources
  • any patterns of applying Claim Review that are specific to the “Full Fact profile”, e.g. whether fields like firstAppearance are used. Or how and when appearances might be revised, etc.
  • areas of the model that may change, e.g. based on further standardisation work

Licensing

Finally, the API ought to have:

  • a clear set of terms and conditions that describe the terms of use of the API, relevant disclaimers, etc
  • separately, a licence that applies to data available from the API

The licence for the data may be different to the licence that applies to the content. For example, the data might be available under a CC-BY licence, but Full Fact may prefer to apply a more restrictive licence to its content.

The licence statement should also be clear that it applies to the data retrieved by the API and not all content or images linked from it.