LibGuides: Research Data Management (RDM): Data Protection,Security, Storage & Backup

Data Formats and Files

File Formats
File Naming Conventions

A sustainable digital format is one that is compatible, for the foreseeable future, with software needed to open and read it.

Unfortunately, as software applications change or disappear over time, data file formats can become obsolete. If you are using a proprietary and/or obscure file format, there is a risk of the format becoming obsolete--making your data unusable.

If you are working in a proprietary/less-sustainable format, consider converting your data to an open, widely-used format when you preserve and share your data. Many software programs allow for saving/converting datasets into more open formats (e.g. save SPSS dataset as CSV). This will better ensure that your data is usable by others and into the future.

Wherever possible, select data formats that have the following sustainability attributes:

Adheres to specifications that are publicly documented versus formats based on proprietary specifications - example TIFF format for images
Is in widespread use and readable with available software - example HTML for hypertext, CSV for tablular data
Is self-describing, i.e., contains embedded metadata that help interpret the context and structure of the data file - example XML files contain headers and tags describing the file's content
Contains as much of the original information as possible - example Motion JPEG 2000, a "lossless" format for digital video.

If you are uncertain of which file formats to select for long-term preservation of your research data, here are some tips to help you decide:

Select formats that ensure the best change for long-term access to data
Favor commonly used and non-proprietary formats
Consider longevity, popularity, and potential for migration
Investigate detailed technical information about file formats using the UK National Archives'PRONOM Registry
Consider requirements of selected data repository: If you intend to deposit your data in a data repository, this repository may have guidelines on how data should be structured and what file formats it will accept.
Many institutions also provide file format recommendations and preferences based on content type:

To emphasis: The most appropriate file-format should be selected for the long-term preservation and continued access to research data.

The following should be taken into account when selecting an appropriate format:

future accessibility
non-proprietary
open, document standard
common, used by the research community
standard representation (ASCII, Unicode)
preferably not software specific.

Best file formats include:

unencrypted
uncompressed
PDF, not Word
ASCII, not Excel
MPEG-4, not Quicktime
TIFF or JPEG2000, not GIF or JPG
XML or RDF, not RDBMS

A comprehensive guideline on various aspects of file formats has been compiled by the Digital Curation Centre (DCC)

This is a set of conventions you define for naming data files, and the folders you keep them in, and for saving multiple versions of files. Using naming/versioning conventions will:

Prevent accidental overwrites or deletion
Make it easier to locate specific data files
Preserves differences in the information content or functionality of different file versions
Prevents confusion if multiple people are working on shared files

Below are some general guidelines for naming files and folders. While it is recommended that these guidelines are followed, it is most important you ensure that:

conventions are defined and documented for your research project,
all members of the research team are aware of these conventions, and
conventions are followed consistently by all team members for the duration of the project.

General naming recommendations

Define a naming convention and be consistent using it, especially if multiple people are sharing files
Avoid "/ \ : * ? " < > [ ] & $ in names. These characters have specific meanings in your computer's operating system that could result in misreading or deleting these files
Use underscores (_) not spaces to separate terms

Folder names

Keep names short, 15-20 characters or less
Use names that describe the general category of files the folder contains

File names

Keep names short, 25 characters or less
Use names that describe the contents of the file
Include a date using the format recommended by ISO 8601: YYYY-MM-DD
Do not include the folder name in the file name unless you are sharing files and there may be confusion about to which folder a file should be added

File versions

Include a version number at the end of the file name such as v01. Change this version number each time the file is saved.
For the final version, substitute the word FINAL for the version number.
Turn on versioning or tracking in collaborative works or storage spaces such as Wikis, Sharepoint, GoogleDocs, or MyWebSpace.
Use a version control system such as Apache Subversion or Git to track versions of files, especially computer code.

Security, Backup, Storage, Tools

Security refers to three main areas:

Data Integrity ensures data has not been manipulated in an unauthorized way. Data integrity techniques include
- Encryption: Coding information that cannot be read/deciphered unless someone has the decoding key
- Electronic signature: Coded message that is unique to both the document and the signer
- Watermarking: Embeds digital marker for authorship verification & can alert someone of alterations
Access security involves restricting access to data. Access to sensitive data (e.g. geographic, administrative, or medical information about individuals) is often regulated by applicable laws.
Systems security is the protection of hardware and software related or connected to research data. This includes updating anti-virus software, applications, and operating systems; utilizing a firewall and intrusion detection software; and controlling access to hardware (e.g. keeping doors to office/server room locked).

Everyone should keep their digital data safe from attack by computer malware.

If you work with sensitive data, you may need to comply with requirements and policies of the university, government, funding agencies and the like.

Backup refers to preserving additional copies of your data in a separate physical location from data files in storage. Backup preserves older copies so you can restore your data if accidental deletion/alteration or a disaster such as fire, flood, or hardware malfunction damages your data in storage.

To safeguard your important data assets, remember both storage and backup are essential.

A granting agency may require you to retain data for a given period and may ask you to explain in a data plan how you will store and back it up.
Storing and backing up your data ensures that it will be there when you need to use it for publications, theses, or grant proposals.
Good preservation practices help make your data available to researchers in your lab/research group, department, or discipline in the future.

What should you backup?
Everything that would be required to restore data in event of loss (data/software/scripts/documentation)

How many copies?
Follow the Rule of 3:

Original copy
second local copy
remote copy

How often?

Backup frequency is dependent on the project and the data. Consider how much data you would be willing to lose.

What type?
Full: Backup all files
Incremental: Backup only files that have changed since last backup (either full or incremental)
Differential: Backup only files that have changed since last full backup

For more details go to: Backup your files - Windows 7

Test your system: Go through the exercise of accessing backup to see that procedure works & you can fully restore your data

Data must be archived in a controlled, secure environment in a way that safeguards the primary data, observations, or recordings. The archive must be accessible by scholars analyzing the data, and available to collaborators or others who have rights of access.

Storage refers to preserving your data files in a secure location you can access readily. Storage systems often provide mirroring, in which data is written simultaneously to two drives. This is not the same thing as backup since alterations in the primary files will be mirrored in the second copy. There are several options for data storage, each with their own pros and cons:

PC/Laptop
- Pros: Convenient for active data
- Cons: Easily lost/stolen; fail; manual backup
Network
- Pros: Automatic backup and security
- Cons: Access/capacity limitations
External devices
- Pros: Low cost; portable; easy use
- Cons: Easily lost/stolen; fail
Remote/Cloud
- Pros: Global access; collaboration
- Cons: Security/privacy limitations
Physical storage (e.g. notebooks)
- Pros: Convenient; tangible
- Cons: Manual backup

Primary research data should be stored securely for sufficient time following publication, analysis, or termination of the project.

The number of years that data should be retained varies from field to field and may depend on the nature of the data and the research.

Sustainable data management is crucial to the value of research and crucial to ensuring continued scholarship. Typically, in data storage, there is an access copy, for use, and an archival copy, essentially for preservation and back-up purposes. Backing up data cannot be overemphasized,just as natural disasters and breakdowns in systems and software cannot be predicted.

"Bags" are ideal for digital content normally kept as a collection of files. They are also well-suited to the export, for archival purposes, of content normally kept in database structures that receiving parties are unlikely to support. Relying on cross-platform (Windows and Unix) filesystem naming conventions, a bag's payload may include any number of directories and sub-directories (folders and sub-folders). A bag can specify payload content indirectly via a "fetch.txt" file that lists URLs for content that can be fetched over the network to complete the bag; simple parallelization (e.g. running 10 instances of Wget) can exploit this feature to transfer large bags very quickly.

Take a look at BagIt

Research Data Management (RDM): Data Protection,Security, Storage & Backup