The XChemExplorer graphical workflow tool for routine or large-scale protein–ligand structure determination

XChemExplorer is a graphical workflow and data-management tool for the parallel determination of protein–ligand complexes. Its implementation, usage and application are described here.


S1. Project directory structure and file name conventions
The project directory structure of XChemExplorer is as follows: <project_directory>/<sample_id> e.g. Each sample folder contains a MTZ file and the corresponding AIMLESS logfile, e.g.: PHIPA-x001.mtz PHIPA-x001.log XCE uses a file called <sample_id>.free.mtz as input for refinement. This file is either automatically generated by the DIMPLE difference map pipeline or users can choose to append an existing Rfree set from a reference file by providing it in the reference folder.
MTZ column labels must have CCP4 default names, otherwise XCE may show unexpected behaviour, i.e. IMEAN, SIGIMEAN, F, SIGF, FreeR_flag. The program can only parse AIMLESS logfiles at the moment.
After DIMPLE is run successfully, the resulting PDB and MTZ files will be linked as dimple.pdb and dimple.mtz into the respective sample directory.
Once the refinement stage is reached a subfolder for each refinement cycle will be created: Refine_<cycle number>. This subfolder contains the modified PDB file, executable shell script and output. The script contains the complete refinement and validation schedule. After successful refinement, the resulting PBD and MTZ files will be linked as refine.pdb and refine.mtz into the sample directory.
It is possible to create this folder structure manually and then choose Data Source -> Update Data Source from filesystem from the XCE menu to import all the information into the database.

S2. Dataset selection of different auto-processing pipelines
It is difficult to know what the best output is when data were processed with different settings and the resulting data collection statistics appear similar. XCE offers a default selection mechanism which will be described below but also offers selection by obvious criteria like highest resolution. Crystal systems used for SBLD are often well characterised, hence the default selection mechanism tries to pick only data processing result which have the same point group and a similar unit cell volume as one of the provided reference PDB files. It tries to eliminate datasets with suspiciously high low resolution Rmerge values and assigns an empirical score to each outcome which serves as the final discriminator. The score is defined as: Where N(reflections) is the number of unique reflection, Completeness is the overall completeness, Mn(I/sig(I)) is the overall signal-to-noise ratio, N(ASU) is the number of asymmetric units per unit cell for a given point group. Details of the selection mechanism are given figure S1.

Table S1
The following pre-defined categories are used to annotate the overall data collection outcome.
Some of the categories are only relevant when data were collected in automatic mode.
success Data collection was successful.
centring failed The crystal was not correctly centred in one or several orientations.

no diffraction
The crystal was correctly centred, but does not diffract.

Processing failed
The crystal showed satisfactory diffraction, but none of the data processing pipelines was able to process it automatically.  1 -Analysis pending Initial maps were calculated, but not analysed.

-PANDDA model
A protein-ligand structure has been built with pandda.inspect but not refined, yet.

-In Refinement
The dataset is currently being refined.

-CompChem ready
The structure is ready for analysis because all regions of interest, e.g. the ligand binding site, are modelled with confidence and the overall quality indicators are satisfactory. There may still be local errors in other parts of the model.

-High Confidence
The respective 2mFo-DFc, mFo-DFc or PanDDA event maps are very well defined and agree with the