Validating a sample manifest

You can use this form to upload a sample manifest and validate it against the HICF checklist.

Note: your uploaded file is not stored unless errors are found. If validation errors are found, we cache an edited version of the file, adding error messages to any invalid lines in the CSV file. You can then download a copy of this validated file using a link that will appear below. Validated files are cached for one hour and then deleted.

Validating a manifest locally

This section explains how to use the validate_manifest perl script to check the contents of a sample manifest before deposition with the HICF sample data repository.

Install the perl module and script

The perl module is available from GitHub. You can either clone the repository or download it as a tar ball. Either way, the module needs to be built using Dist::Zilla before it can be installed.

The easiest way to install perl modules is probably to use cpanm (see https://github.com/miyagawa/cpanminus). Once cpanm is installed, you can use it to install Dist::Zilla and, with Dist::Zilla available, you can unpack and build the validator distribution:

shell% tar zxf sanger-pathogens-Bio-Metadata-Validator-xxxxxxx.tar.gz
shell% cd sanger-pathogens-Bio-Metadata-Validator-xxxxxxx
shell% dzil build
[DZ] beginning to build Bio-Metadata-Validator
[DZ] guessing dist's main_module is lib/Bio/Metadata/Validator.pm
[DZ] writing Bio-Metadata-Validator in Bio-Metadata-Validator-x.xxxxxx
[DZ] building archive with Archive::Tar::Wrapper
[DZ] writing archive to Bio-Metadata-Validator-x.xxxxxx.tar.gz
[DZ] built in Bio-Metadata-Validator-x.xxxxxx

You can now install the Bio-Metadata-Validator tar file:

shell% cpanm Bio-Metadata-Validator-x.xxxxxx.tar.gz
...

After installation you should be able to run the script something like:

shell% validate_manifest
validate_manifest [-chiov] [long options...] 
      -c --config           path to the configuration file that defines the
                            checklist
      -o --output           write the validated CSV file to this file
      -i --write-invalid    write invalid rows only
      -v --verbose-errors   show full field descriptions in validation
                            error messages
      -h --help             print usage message
shell% 

You can use validate_manifest -h to see more detailed documentation.

Download the ontologies and taxonomy data

The checklist requires that certain fields (e.g. location) contain ontology terms and others (e.g. scientific_name) contain valid scientific names or tax IDs for organisms. The validation script can check that values in your manifest are found in the required ontologies or the taxonomy tree, but you need to download the data files and store them locally first.

These are the three ontologies files that are needed:

There is one taxonomy file, but it is only available as a tar archive:

After downloading the tar file, you need to extract the "names.dmp" file:

shell% tar zxf taxdump.tar.gz names.dmp

If you change the names of the files for any reason, you will need to edit the checklist configuration file accordingly.

Build your manifest

Your sample data must be formatted as a "comma-separated values" (CSV) file. If you have sample data in Excel, you can export a CSV file using:

File Save As... Format: Window Comma Separated (.csv)

If you are creating the CSV file using a script or similar, you can see an example file in the bundle containing the Perl module and README. You can download a tar archive containing the manifest template and example manifests in Excel and CSV formats.

Validate the manifest

Download the checklist configuration file. It's easiest to run validate_manifest in the directory containing the config file, your manifest, and the ontology files ("envo-basic.obo", "BrendaTissueOBO", "gaz.obo").

shell% validate_manifest -c HICF_checklist.conf example_manifest.csv
input data are valid

Error messages are appended to invalid rows as an extra column, so that you can re-import the file into Excel and view the error messages along with the data. If you find problems with your manifest, you can write out the invalid rows and check the embedded error messages:

shell% validate_manifest -c HICF_checklist.conf -o invalid_rows.csv -i broken_manifest.csv
input data are invalid. Found 1 invalid row.
wrote only invalid rows from validated file to 'invalid_rows.csv'.
shell% less invalid_rows.csv
"raw data accession","sample accession","sample description","collected at","tax ID","scientific name","collected by",source,"collection date",location,"host associated","specific host","host disease status","host isolation source","patient location","isolation source",serovar,"other classification",strain,isolate,"antimicrobial resistance"
ERR000001,ERS000001,"Example description",CAMBRIDGE,703339,"Staphylococcus aureus 04-02981","Tate JG, Keane J",123,05/10/2013,GAZ:00444180,yes,"Homo sapiens",healthy,BTO:0000645,inpatient,,I,,630,,"tetracyclin;S;40,erythromycin;R;50;Peru","[errors found on row 1] [value in field 'collection_date' is not valid]"

Scrolling to the end of the line you can see that there was a single error on a one row of the manifest. In this case the date was specified in an invalid format, using the default Excel format (05/10/2013) rather than the required ISO format (e.g. 2013-10-05).

Resources

These GitHub repositories contain the files that you will need to run the validator:

Bio-Metadata-Validator
Perl module
HICF_checklist
Checklist and examples

You can download the contents of the checklist tar file individually too:

00README.txt
README containing this documentation
broken_manifest.csv
example of an invalid manifest in CSV format
example_manifest.csv
example of a valid manifest in CSV format
example_manifest.xlsx
example of a valid manifest in Excel format
HICF_checklist.conf
checklist configuration file
invalid_rows.csv
invalid rows found when validating broken_manifest.csv
midas_manifest_v5.xlsx
manifest template in Excel format

These are the ontology files that are required for the HICF checklist:

http://purl.obolibrary.org/obo/subsets/envo-basic.obo
Environment ontology
http://www.brenda-enzymes.info/ontology/tissue/tree/update/update_files/BrendaTissueOBO (1.8Mb)
BRENDA tissue ontology
http://purl.obolibrary.org/obo/gaz.obo
Gazetteer ontology
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
NCBI taxonomy