Quality control
The Archive is at the forefront of developing international standards for data processing, for quantitative and qualitative data processing.
We use different levels of quality control depending on how much ‘additional value’ is to be added to the data.
We assign one of four levels of data processing to each incoming study, depending on anticipated future usage (A*, A, B or C). Data review, privacy assessment, and content checks and validation are critical.
This is a summary of our content and validation checks. See UK Data Archive Data Processing Standards (PDF) for full details.
Quantitative data
Dataset dimension checks
- Level A*: the number of cases and variables are checked against the documentation
- Level A: as for A*
- Level B: as for A*
Metadata checks
- Level A*: the dataset must be comprehensible in itself - i.e. all variables should have variable labels and all categorical variables should have value labels
- Level A: the dataset must be comprehensible in association with the documentation given to users
- Level B: visual checks on quality are undertaken; action is taken for systematic problems
Data validity checks
- Level A*: all categorical variables checked for out-of-range values/wild codes; where possible, interval variables checked for improbable or impossible values; variable and value labels need not be present in the data file as long as they can be found in the documentation
- Level A: as for A*
- Level B: visual checks on quality are undertaken; action is taken for systematic problems
Confidentiality checks
- Level A*, A and B: always undertaken
Metadata enhancements
- Level A*: for online browsing, the following may be added: literal question text, routing information and interviewers' instructions, frequencies and summary statistics, variable groups; bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'
- Level A: bookmarked PDF user guides are produced; additional related resources may be provided on a dedicated web page; additional notes to users are given in the 'Read file'
- Level B: bookmarked PDF user guides are produced; additional notes to users are given in the 'Read file'
ReShare
- Level B: sample of 30 + 10 per cent of the remaining categorical variables must be checked for out-of-range values/wild codes; sample of 30 + 10 per cent of the remaining suitable interval variables must be checked for improbable or impossible values
Qualitative data
In addition to the levels above, most qualitative data collections are processed to A standard, with a select few being nominated for enhancement to A*. B and C are seldom used, but apply when handling older paper-based studies.
Level A*
- data are fully digitised and anonymised
- metadata and documentation are fully digitised and anonymised
- for online browsing, data are marked up in XML
- enhanced user guide is prepared for QualiBank
Level A
- the dataset must be comprehensible in association with the documentation given to users
- data are fully digitised and anonymised
- metadata and documentation are fully digitised and anonymised
Level B
- data are digitised at least to the level of scanned images and anonymised
- metadata and documentation are digitised at least to the level of scanned images and anonymised
- only major problems with data are resolved
Level C
- no checks are made
- data remain in the format in which they were received
- non-digital collections are not anonymised or digitised and are transferred to another repository
- only a basic catalogue record is created
For level C studies, a minimum of dataset dimension checks and confidentiality checks is carried out, with metadata enhancements as for B studies.
Format translation checks
Check are carried out when converting from:
- the ingested format to our preservation format (tagged or delimited text of defined character set)
- the preservation format to the dissemination formats (Stata and tab delimited text; or MS Excel, MS Access, SIR and SAS)
We use in-house programmes to automate most data format conversions for all levels of processing. These make sure no data or 'internal metadata' (variable and value labels, missing value definitions, variable format information, etc.) are lost beyond any that would occur because of differential data handling limits in specific software formats.
For data formats the following checks below are currently performed manually, but will be replaced by automated checking using the QAMyData tool.
Numbers of rows and cases the same
- Level A* and A: Relevant checks made, problems corrected
- Level B: Relevant checks made, problems corrected
- Level C: Format conversion is not usually undertaken for C standard datasets. C standard is rare, but one of the reasons for it is that the data file cannot be converted from its original format, so normal processing cannot be undertaken. Relevant checks must be made, problems corrected
Number of decimal places the same for numeric formats
- Level A*, A and B: Relevant checks made, problems corrected
String variables not truncated
- Level A*, A and B: Relevant checks made, problems corrected
Date/time variables correctly formatted
- Level A* and A: Relevant checks made, problems corrected
- Level B: Relevant checks made, problems noted in the user 'Read file'
Internal metadata (variable names, variable labels, value labels and definition of missing values) not lost or altered
- Level A* and A: Relevant checks made, problems corrected where possible
- Level B: Relevant checks made, problems noted in the user 'Read file'
Data download validation
For data available via the UK Data Service download system, the names of the zip files include an MD5 checksum. This 32-character string can be used to verify that the file we make available is identical to the one the user downloads.