What is Metadata?
Although metadata may seem like a foreign concept, we are surrounded by it every day. Walking through a grocery store, we encounter thousands of pieces of metadata.
Every can, box, bag, and bottle contains a label. The labels are examples of metadata. They describe the contents of the product, its ingredients, nutritional information, calories and fat grams, processing, volume, and manufacturer information.
This is a real-world example of how we use metadata to acquire information about data. The product labels provide information about a package's contents.
Here is an example metadata record from a study conducted at DISL. The title is "Strandings of Marine Mammals in Alabama 1978-2015".
Why is Metadata important?
Metadata is important for helping other scientists, managers, policy makers, and the general public find, understand, and use your data. Well-written metadata allows others to use the data easily and effectively without having to ask the original PIs any questions - which becomes important when the original PIs eventually stop being available.
The traditional way of doing science was to collect data, analyze it, write peer-reviewed papers about it, and gain recognition in the scientific field based on citations of the peer-reviewed papers. The data, meanwhile, is relegated to the back of a file cabinet or a personal hard drive where it eventually gets lost and disappears.
In 1994 the federal government mandated that "... all new geospatial data collected or produced, either directly or indirectly … [must be accompanied by] standardized documentation electronically accessible to the Clearinghouse network" (Executive Order 12906). In the past decade, the value of the actual data and the importance of open data sharing have become well-established concepts. Many funding agencies and journals now require PIs to share their data.
- One of the main principles of the scientific method is reproducibility - where another scientist working independently should be able to reproduce your study and arrive at the same results. Dataset availability is required for this.
- Another foundation of science is the ability to build upon previous research. This also requires availability of previously collected datasets.
- Datasets collected for one purpose in one study can often be used for other purposes in other studies, sometimes in unforeseeable ways.
- Datasets collected, processed, and analyzed with public funds should be made available to the public.
On a practical level, the process of writing detailed metadata for a dataset often causes the dataset to become well-organized and presentable to people other than the original scientist. A thorough documentation effort shortly after the end of the study will save countless hours years later, for both the original scientist and others trying to use the data.
Well-organized datasets with complete metadata can be submitted to regional and national data repositories such as GoMRI GRIIDC, NOAA NCEI, and NSF's BCO-DMO. Many repositories feature search functions by keyword, location, and originator, to promote data discovery. Once published, datasets can be cited in the same way as peer-reviewed papers, bringing career-building recognition to data providers.
Metadata for Science Datasets
A fully developed metadata record has two major components.
A. Information about the dataset as a whole, including:
- What - an abstract that describes what kind of data the dataset includes
- Why - a purpose that explains what the research project was about and why it was important
- Who - contact information for the PIs, students, or staff that collected, processed, distributes, or are otherwise responsible for the dataset. Funding agencies. Citation information and use constraints.
- How - methods used to collect, process, and analyze the data. Data quality (data gaps, outliers, etc)
- When and Where - Ideally this includes timezone information, latitude and longitude, and water column depth when appropriate.
B. Detailed information about each piece of data in the dataset, such as:
- Explanations for all labels, column headers (spreadsheets, tables, etc), or variables (Matlab files, etc)
- Full spelling for abbreviations and other shorthands, including units
- Units for every number. Ideally, all numerical values of a given variable or table column will have the same units. If this is not the case (for example in citizen science projects), it should be noted in the metadata.
- Definitions for every categorical item. For example, if "season" includes spring, summer, fall, and winter, please specify which months constitute each of these seasons.
- Scientific names of all biota
- Full citations for any published protocols and methodologies, reports and peer-reviewed publications relevant to the dataset, etc.
- Definitions or citations for codes and published code lists
In the ISO 19139 metadata standard that DISL presently uses, the first component goes in a "main" document (ISO 19115-2) and the second component is a separate document called a Feature Catalogue (ISO 19110).
Metadata at DISL
In 2007, DISL Data Management began establishing standard data archiving and documentation practices for all geospatial datasets generated at DISL. Data documentation between 2007 and 2015 at DISL used the FGDC CSDGM metadata standard and the MERMAid metadata writing tool developed by NOAA's National Coastal Data Development Center (NCDDC, now NCEI). MERMAid is phasing out of service in September 2016.
In January 2015, to follow changing practices at all federal agencies including NOAA, DISL began using the ISO 19139 metadata standard for data documentation. Although there is no direct equivalent to the MERMAid tool for writing ISO metadata, several options are available:
GeoNetwork - a dataset catalog application that includes an excellent metadata writing tool, supporting both ISO and FGDC standards. Although it's intended to be server software, it can be downloaded and installed onto individual computers for just metadata editing.
NOAA NCEI ATRAC System - a dataset submission system for NCEI that includes an ISO metadata editor. It is free to register and just use the editor. It does not include an option to create Feature Catalogues.
GoMRI GRIIDC - an online metadata editor hosted by the Gulf of Mexico Research Initiative. It is good for quick, basic metadata. It does not include fully developed Feature Catalogue or Data Quality sections, opting instead to provide "Supplemental Information" single text boxes for that information.
DISL ISO Metadata Generator - a basic metadata editor hosted by DISL Data Management and customized for the types of datasets generated at DISL. It can generate completed metadata for simple datasets, or a starting point template for more complex datasets. Requires some basic knowledge of XML to look at the finished metadata. Can be used in conjunction with a basic text editor, XML Notepad, an expanded color-coded tutorial template also customized for DISL, NOAA's ISO workbooks or Wiki, and as much help as you need from Mimi Tzeng (metadata specialist who made this editor) via email, phone (x2129), or office visit (MSH 216).
Guidelines for Creating Metadata Records at DISL
ISO 19139 metadata records generated at DISL for DISL-associated projects should include the following features:
1. File Identifier
The file identifier should be in the following format:
DISL - PI - Student or Project - ### - year
PI is the last name of the faculty member in charge of the lab that produced the dataset.
Student or Project: Student is the name of the student, postdoc, or research staff who collected the dataset or oversaw its collection. Project is the overall research program that includes the dataset.
### is the metadata record number for the particular student or project. If Maury Estes has two datasets documented, the metadata records would be 001 and 002.
Year is the year that the metadata record was published.
The file identifier will also be the eventual filename of the combined metadata record as it appears in the DISL Metadata Archive.
2. Abstract vs Purpose
Your abstract should be a description of the dataset itself - what kind of data is included, where and what timeframe, etc.
Your purpose is where you describe the study for which you collected the dataset. You do not need to justify the existence of the dataset in your purpose - it already exists, you are now documenting it.
3. Date, Time, and Location
Date, time, and location information in the dataset may be in any format you like. Please specify in the Feature Catalogue. Metadata about time should include timezone and whether Daylight Savings was accounted for.
Date, time, and location information in an ISO 19115-2 metadata record has specific formatting requirements. All dates and times in the metadata must be YYYY-MM-DD HH:MM:SS. Latitude and longitude must be decimal degrees.
4. Place Keywords
Please include "DISL" and "Dauphin Island Sea Lab" in the place keywords. This will allow DISL Data Management to more easily locate metadata records that originated at DISL after the metadata records have been submitted to other repositories.
When citing other publications, please include the full citation. "Carmichael et al (2016)" alone is insufficient for locating the publication beng cited.
6. Dataset Items
In scientific papers, every method must have a result, and every result must have a method. For metadata records, any data you mention in the Abstract, Purpose, Data Quality/Methodology, etc. should be an available item in the Feature Catalogue.
In ISO 19139, the Feature Catalogue is a separate document (ISO 19110) that is cited in the main metadata record. Long-established labs with many datasets organized in the same way regardless of project may opt to create a single Feature Catalogue that includes everything. The Feature Catalogue in this case may include items that are not part of a specific dataset.
7. Browse Graphics
Browse Graphics are highly recommended when information about your dataset would best be presented visually, e.g. a study area map for field-collected datasets, or a diagram of experimental design for data from complex lab or mesocosm experiments.
8. Use Constraints
This space is intended for copyright licensing, legal restrictions, etc. for how the data may be used by others. It may include a formal data sharing plan. DISL Data Management has a default template that requires other data users to acknowledge the funding agencies.
9. Contact Information
In the completed metadata record, the first Responsible Party at the top is the author or custodian of the metadata record.
Responsible Parties for the dataset are listed below the abstract and purpose. There may be as many different individuals or organizations listed as is needed, each with a different role (e.g. principal investigator, originator, author, custodian, point of contact, etc).
10. Miscellaneous Credits
ISO 19115-2 has a general purpose credit tag that may be used to acknowledge any additional personnel or organizations involved in collecting, processing, analyzing, or funding the data. Multiple credit tags are allowed.
DISL Data Management recommends including the name of the lab where the dataset originated, in the format "DISL: Name of Lab."
Metadata for Science Software
One of the main principles of the scientific method is reproducibility - where another scientist working independently should be able to reproduce your study and arrive at the same results.
Although the concept of open sharing for datasets has become well established in the past decade, datasets alone will not always allow full science reproducibility. It is often necessary to also share the computational methods used to process data and models (computer scientists refer to these processing scripts as "software"), along with the sequence of processing steps (computer scientists call this a "workflow").
Software sharing is similar to dataset sharing. The NSF EarthCube OntoSoft project has developed a metadata standard for software, along with an online editor and metadata repository system, to begin enabling widespread software documentation. In collaboration with OntoSoft, DISL has its own dedicated portal for all of our software:
If you have data processing scripts that you would like to share, such as written in R, Matlab, python, etc., please try it out. The portal is still under development and the OntoSoft team welcomes all feedback.
Metadata Training Videos