Coiled-coils dataset description

In order to validate and compare the performances of the different methods it is very important to select a reliable dataset. In particular, a safe dataset is represented by the intersection of the SCOP protein classification and the SOCKET program output.
In this perspective, we selected protein structures on the basis of the following steps:
1) We downloaded SCOP (release 1.69) and we selected all the structures classified as belonging to coiled-coil superfamily (superfamily "h")
2) Each structure selected at point 1, was then processed with the SOCKET program that automatically identifies coiled-coil motifs.
To identify the contacts a packing cutoff of 7.4 Å has been chosen. If no coiled-coil segments were predicted by SOCKET the structure was discarded.
The annotation is thus obtained using the sequence segments labelled by SOCKET as coiled-coil regions.
When SOCKET detected overlapping segments in a given position of a sequence (due to the multiple contacts of coiled-coils in three-dimesional structure), we defined the coiled-coil domain using the union of all the coiled-coil segments.
Furthermore, we also excluded:
- protein structures with holes in the coordinates
- protein structures with sequence length shorter than 30 residues
- protein structures with coiled-coil domains shorter than 9 residues.
Following this procedure we ended up with a structurally annotated dataset of 104 protein chains (S104).
For training and benchmarking the predicting methods we also make available sets of sequences clustered by sequence identity. Furthermore, we also provide a set of 50 proteins (S50) that do not contains proteins whose sequences has more than 30% identity with all the sequences of the Marcoil dataset. This dataset is useful in order to blindly test methods developed using Marcoil dataset or older ones.

Negative dataset description

We downloaded the Astral SCOP (release 1.69) which contains sequences with less than 40% identity. The selected sequences have been processed with SOCKET (7.4 Å packing cut-off) and all the sequence for which the program detected at least one coiled-coil residue were removed from the dataset.
We also filtered out all the sequences similar (>25% identity) to one of the Marcoil negative set. We also checked the corresponding entries of the PDB in order to further remove all the structures annotated as coiled-coil or coiled-coil related.
Finally, we clustered the remaining sequences fixing the sequence identity threshold to 25% and we choose a representative for each cluster.
The negative dataset consists of 1139 proteins sequences.

Download datasets