File formats

Different file formats are used for different purposes. Some are more suitable for reports such as Word documents or PDFs. Some are good for graphs or images. Others are more suitable for storing data such as spreadsheets.

The file format that data is contained in will affect what a data user can and can not do with data.

For example:

  • If tables of data are within PDFs, users can’t quickly do analyses on these numbers
  • If a graph is within an image, people don’t have access to the raw numbers
  • If data is in an Excel file it can carry more information than a CSV file
  • A CSV file can be used many more types of software

Data can also be published in multiple formats at the same time - to give a full “package” for the data. For instance you could have a report in a PDF, also have it as a HTML page on the website, with links to the data as downloadable spreadsheets along with image files of graphs.

Formats for data tables

Data tables (tabular data) are a fundamental part of publishing useful data. Providing well organised, well-labelled, well-explained tables can be of huge benefit to fact checkers and other data users in doing their work.

NSIs can consider which format is best to publish their data tables. When possible, they can also publish the data in multiple formats to cover a wide range of users.

Comma-separated-value (CSV) files

  • One of the most popular ways to publish data tables on the web
  • Each row in the table is represented by a line in the text document, with each cell on the row separated by a comma.
  • Very simple and basic file structure, works with any text editor
  • However, lacks any built-in way to include metadata - so context may get lost as its shared
  • You can (and probably should) include metadata about the contents of the CSV file in an accompanying file. An unstructured text document or a structured document like a JSON file.
  • They also can’t force value types on columns (unlike an Excel file)

Spreadsheet files

  • Most often seen as a Microsoft Excel file (.xls or .xlsx)
  • Can bundle extra information, visualisations, formulae in along with the data
  • Can ensure value types
  • Much larger file size than storing the data in a CSV file
  • Will only work with specific spreadsheet software – Excel, OpenOffice, Apple Numbers

Formats for reports

Analysis of data is obviously essential to the work of an NSI. Giving context to the numbers, pulling out the main analysis and developing the narrative of how this data impacts the country.

Different formats for publishing these reports allow for different things. So it’s worth considering whether to publish these as a document or as a web page, or both.

Document file (PDFs)

  • From our research, the most common format for publishing reports
  • They allow free text which helps analysis and context about data. This is obvious but other formats such as CVS make this harder and can’t hold accompanying metadata, while Excel files can hold metadata but are bad for including free text analysis.
  • PDF is an open, non-proprietary standard, which is a positive.
  • PDFs nearly always have text that can be copied and pasted elsewhere.
    • Very old, scanned documents may not have this feature. Some NSIs we spoke to said some old (10+ years) documents may have this problem.
  • Data in PDFs is often stored as tables
    • Useful for reading and discussing.
    • But makes it hard for others to do analysis on this data.
  • Examples:

Page on website

  • A dedicated page on the NSIs website with the full analysis
  • Has a unique URL which can be shared and linked to
  • No software needed to read this other a standard web browser
  • Can allow for clear download links to datasets
  • Can hold richer media than PDFs such as interactive visualisations, video or audio.
  • Examples: