Working with Sensitive Data¶
Establishing a Moderately Sensitive (P2/P3) Project¶
Researchers can use Savio to work with moderately sensitive data - what campus classifies as P2/P3 data. To do so, you must create a special project, which uses separate memory, storage and scratch resources than other Savio jobs. You must set up your project before you upload your data.
Savio has been architected to comply with the requirements of many data use agreements. We can work with you when setting up your account to ensure the environment fits the requirements of your project’s data use agreement. Please note that the user is ultimately responsible for complying with those requirements.
Highly Sensitive Data
Savio is not appropriate for highly sensitive (P4) data. Please see the Secure Research Data and Compute web page for information on the campus initiative to support researchers working with highly sensitive data.
Getting an Account to Work with Sensitive Data¶
To start the process of creating a P2/P3 project, see Accounts for Sensitive Data. Once you have created your project, you’re ready to move your data into Savio.
Storing Sensitive Data¶
Sensitive data, including NIH dbGap data, must be stored in the shared group directory assigned to your P2/P3 project. Each user of a project is allotted a secure scratch directory as well. All covered data must be stored within these locations. Researchers should not store sensitive data -- either source or derived -- in their user home directories.
It is the responsibility of the user to remove data from the secure scratch filesystem at the end of a research workflow. The BRC team will implement tools to assess when data in a P2/P3 scratch area have not been used within the specified time frame, and will notify the researcher that they must remove the data.
For data covered under the NIH guidelines for cloud providers, see also the document NIH Active Research workflow for NIH data for removing unencrypted data from scratch.
Users are discouraged from setting up backups of any sensitive data. If a user chooses to arrange for backups or other copies outside the Savio environment, it is the sole responsibility of the user to comply with all requirements related to data security, encryption, and deletion. Research IT and the campus Information Security Office can help assess compliance with your granting agency requirements.
Encrypting Sensitive Data¶
The Researcher Use Agreement (RUA) for using P2/P3 data in the Savio HPC environment includes notes that defines the researcher as responsible for encryption of hard copies.
No covered data in unencrypted form may ever be stored (even temporarily) in project folders.
Software (file-level) encryption or equivalent is required for all covered data in a long-term storage environment, and when the data is not being actively used in a research workflow. Researchers must agree that to prevent unauthorized access to covered data, unencrypted data must not be left in the system during vacations, significant absences, or other periods of inactivity.
As soon as the research workflow comes to an end, or if the researcher pauses or suspends the workflow for any substantive period (as determined by the PI), all unencrypted data must be deleted from the scratch filesystem.
Transferring Sensitive Data¶
The transfer of encrypted data into Savio scratch will generally either be from the P2/P3 project space in the Savio environment, or from an external data source. The data must be transferred into the P2/P3 designated scratch space for a user approved on the associated P2/P3 project.
Computing with Sensitive Data¶
Decryption of enctypted data must be done as a normal compute job run according to the RUA (in particular, not run on a login node). Decrypted data must be stored in the P2/P3 designated scratch space for a user approved on the associated P2/P3 project.
Analysis of data and derivatives is the active phase of research and may last for days, weeks, or even longer. It is not necessarily a single job in the cluster, and may represent a series of jobs in a workflow. The scheduling of these jobs is subject to scheduling constraints, resource availability, and other factors that the researcher cannot control. Certain steps in a research workflow may involve review of intermediate results, and subsequent resumption of analytic or other processing steps. So long as a researcher is in an active phase of ongoing work, data may remain in an unencrypted form in the P2/P3 scratch area.
For P2/P3 data sets in which the overhead for decryption would amount to a small percentage of the total compute time (e.g., less than 5% of the total), the data should be decrypted at the beginning of the computational workflow, and then any unencrypted data should be be deleted as a clean-up task as soon as the associated computational workflow completes (noting that a computational workflow may consist of a series of compute jobs).
The Visualization node, which is used to host interactive RealVNC sessions for any user needing more than a command line interface, has not been assessed for compliance and is not permitted for P2/P3 data.
The JupyterHub node, which is used to host web-based iPython notebook sessions for any user, has also not been assessed for MSSEI P2/P3 compliance and is currently not permitted for researchers working with restricted data.