HOW WE CURATE DATA

OUR QUALITY CONTROL

The Archive is at the forefront of developing international standards for data processing, for both quantitative and qualitative data. We use different levels of quality control depending on how much ‘additional value’ is to be added to the data.

We assign one of four levels of data processing to each incoming study, dependent on anticipated future usage. A processing standard can be A*, A, B or C.  Processing activities are then carried out in accordance with each processing level, as described in the tables below.

Data processing activities for the majority of data fall into validation and content checks, and format translation checks by level of processing.

Validation and content checks

The main validation and content checks for data and documentation are listed below. Further details may be found in the UK Data Archive Data Processing Standards document.

VIEW DATA PROCESSING STANDARDS

Qualitative data

Most qualitative data collections are processed to A standard, with a select few being nominated for enhancement to A*. B and C are seldom used but apply when handling older paper-based studies.

Level A* Level A Level B Level C
  • data are fully digitised and anonymised
  • data are marked up in XML
  • data are additionally made accessible through the UK Data Service
  • metadata and documentation are fully digitised and anonymised
  • metadata and documentation are accessible through the UK Data Service
  • enhanced user guide is prepared for Qualidata Online
  • data are fully digitised and anonymised
  • data are made accessible via the UK Data Service
  • metadata and documentation are fully digitised and anonymised
  • metadata and documentation are accessible through the UK Data Service
  • data are digitised at least to the level of scanned images and anonymised
  • data are made accessible via the UK Data Service
  • metadata and documentation are digitised at least to the level of scanned images and anonymised
  • only major problems with data are resolved
  • metadata and documentation are accessible through the UK Data Service
  • no checks are made
  • data remain in the format in which they were received
  • non-digital collections are not anonymised or digitised and are transferred to another repository
  • only a basic catalogue record is created

Quantitative data

Level A* Level A Level B
Dataset dimension checks
  • the number of cases and variables are checked against the documentation
  • as for A*
  • as for A*
Metadata checks
  • the dataset must be comprehensible in itself - i.e. all variables should have variable labels and all categorical variables should have value labels
  • the dataset must be comprehensible in association with the documentation given to users
  • visual checks on quality are undertaken
  • action is taken for systematic problems
Data validity checks
  • all categorical variables must be checked for out-of-range values/wild codes
  • where possible, interval variables must be checked for improbable or impossible values
  • as for A*
  • a sample of 30 + 10 per cent of the remaining categorical variables must be checked for out-of-range values/wild codes
  • a sample of 30 + 10 per cent of the remaining suitable interval variables must be checked for improbable or impossible values
Confidentiality checks
  • always undertaken
  • always undertaken
  • always undertaken
Metadata enhancements
  • the following are added: literal question text; routing information and interviewers' instructions; frequencies and summary statistics; variable groups
  • extensively bookmarked PDF user guides are produced
  • additional related resources are provided on a dedicated web page
  • additional notes to users are given in the 'Read file'
  • extensively bookmarked PDF user guides are produced
  • additional related resources are provided on a dedicated web page
  • additional notes to users are given in the 'Read file'
  • a bookmarked PDF user guide is produced
  • additional notes to users are given in the 'Read file'

For level C studies, a minimum of dataset dimension checks and confidentiality checks is carried out, with metadata enhancements as for B studies.

Format translation checks by level of processing

These checks are carried out on conversion from the ingest format (the format the data arrive) to the preservation format (tagged or delimited text of defined character set). They are also carried out from the preservation format to the dissemination formats (typically Stata and tab delimited text) but also sometimes MS Excel, MS Access, SIR and SAS.

At the Archive we have in-house programs to automate most data format conversions for all levels of processing. These ensure that no data or 'internal metadata' (variable and value labels, missing value definitions, variable format information, etc.) are lost beyond any that would occur because of differential data handling limits in specific software formats.

SEE DATA FORMATS

The checks below are performed manually for the few types of data conversion that do not have a quality checked automated conversion programme.

Data processing format conversion checks:

Level A* Level A Level B Level C
Numbers of rows and cases the same R + C R + C R + C
R + C
Number of decimal places the same for numeric formats R + C R + C R + C
String variables not truncated R + C R + C R + C
Date/time variables correctly formatted R + C R + C R + N
Internal metadata (variable names, variable labels, value labels and definition of missing values) not lost or altered

R + C where possible

R + C where possible R + N

R = relevant checks must be made
C = problems encountered must be corrected
N = problems encountered need not be corrected but must be noted in the 'Read file' supplied to users with each order

Data download validation

For data available via the UK Data Service download system, the names of the zip files include an MD5 checksum. This 32-character string can be used to verify that the file we make available is identical to that which the user downloads.


OUR SERVICES  

DATA LIFECYCLE