Improving the Database Schema

Last updated last month

With a subject as complicated and varied as fluorescent proteins, it is inevitable that there will be data presented in the literature that does not easily “fit” into one of the available fields in the current database schema. The current database schema reflects many of the most commonly-reported attributes (and the way in which they are usually reported) but there is much more information about FPs that one might want to capture in the database; photochromic proteins in particular have a number of quantifiable characteristics that are not currently in the database. Additionally, many of the fields that do exist are oversimplified: bleach curves and maturation rates are currently stored as a half-lives, which do not fully capture complex non-monoexponential dynamics; state transitions are represented by single wavelength values rather than an absorption spectrum, and non-light-based state transitions (such as “timer” proteins) are not currently easy to represent; aggregation state is reported categorically as “monomeric” or “dimeric” when in reality there may be a continuum of oligomerization tendency.

In some cases, these oversimplifications are merely reflections on the way the data is reported in the literature. In other cases, detailed datasets have been published that exceed the capacity of the current database schema. While reference “excerpts” were specifically introduced as a partial solution for storing hard-to-categorize information, it is an ongoing goal to modify the database schema to accommodate these valuable datasets, particularly in cases where a similar type of measurement has been made across multiple references.

Conflicting data in the literature presents another challenge, for example: a protein introduced as “monomeric” in the original publication that is later identified as a weak dimer, or extinction coefficient measurements varying between publications. The ability to store multiple measurements of the same parameter from different references, along with metadata describing the way the measurement was performed, is a high priority and is currently under development.

I strongly encourage site visitors (particularly fluorescent protein developers) to contact me with ideas on how to improve the database schema in order to accurately represent FP measurements.