Due to the interdisciplinary nature of PISCO research, there is no "one-size-fits-all" technology to accommodate all data management needs. PISCO researchers work with large oceanographic data, small but complex biological data, genetics data, physiology data, Geographic Information Systems (GIS) layers, etc. Given the challenges presented by these different types of data, PISCO adopted a disciplinary team approach where data managers work with researchers to produce well-documented datasets and tailor storage needs. The details (i.e. metadata) are structured in a way that allows a user to search, parse, and manipulate the data contents such that the PISCO databases remain independent of the structure of any of the individual datasets. This allows our researchers flexibility in defining what is collected, without causing us to re-build our central databases for each new data type.
PISCO uses two main detailed metadata technologies to serve and manage our complex data sources: the software tools from the Knowledge Network for Biocomplexity project (KNB), and the tools used by the Integrated Ocean Observing System (IOOS) that are based on the Open-source Project for a Network Data Access Protocol (OPeNDAP). Both of these systems use detailed descriptions of the data structures and formats in order to allow researchers to download, subset, or transform data. All data are archived as ASCII text files to promote the longevity and interoperability of the research datasets. Common binary file formats are also used for ease-of-use and performance on a day-to-day basis.
For day-to-day operations, the PISCO servers rely on Red Hat Enterprise Linux (RHEL 5), Community Enterprise Linux (Cent OS 5), or Mac OS X Leopard (Mac OS 10.5). The systems can accommodate a mixed environment of Macs and PCs, and were chosen for their reliability and ease of administration. PISCO data managers run standard Windows File Sharing services using Samba software and rely on the Apache Foundation’s Apache and Tomcat web server software. The open source directory server software, OpenLDAP, provides an industry-standard personnel and authentication database. Each of the servers are backed up on a consistent nightly, weekly, and monthly schedule such that any archived data or analysis work performed by PISCO researchers may be easily restored in the event of file deletion or hardware failures.
For web-based application programming, data managers use open source software tools and have written the most recent client and server code in Java and PHP. For instance, the Data Catalog Access Portal was written using the Google Web Toolkit, where database specialists write the application logic using object-oriented techniques in Java, and then ‘compile’ the code into an interactive HTML and Javascript application that is both fast and cross-browser compatible. This allows staff to easily work with the Java-based server technologies that are deployed, specifically the Metacat data and metadata storage database developed at NCEAS.
In order to fully document our datasets, PISCO uses the Ecological Metadata Language (EML), which is a flexible XML-based encoding syntax that provides containers for describing both geospatial and non-geospatial datasets. It addresses some of the drawbacks of other metadata standards, such as the FGDC Content Standard for Digital Geospatial Metadata, but can easily be transformed into this and other encodings in order to satisfy federal standards of compliance without sacrificing some of the features provided by EML.
To create EML files for each of our datasets, we use a combination of automated programs and manual data entry. Since our oceanographic data are largely comprised of uniform, yet large, data streams coming from in situ sensors, PISCO database specialists developed a toolbox in Matlab that uses a template system to document each deployment’s data file. Staff use this toolbox to upload new deployment files and their EML descriptions to the PISCO Data Catalog. For the more complex biological survey data, we use a software tool called Morpho, developed at NCEAS, to manually create the EML documentation.
Once a data file is fully documented and uploaded to the Data Catalog, we rely on Metacat’s built-in replication feature to mirror the data and metadata file to the other PISCO campuses. This feature provides for a level of redundancy that safeguards the data archive, and will allow us to take advantage of faster, local network speeds when querying the Data Catalog holdings. We also use this replication feature to mirror the PISCO data collections to other national networks. The files are pushed to the KNB servers at NCEAS, and are replicated to the LTER Network Office in Albuquerque, NM. The metadata are then harvested by the National Biological Information Infrastructure (NBII) clearinghouse, where they are stored as part of the national archive of the National Spatial Data Initiative (NSDI).
The hardware systems that are installed to support PISCO’s research programs are largely comprised of multi-core Intel-based servers that run either the Linux or Mac OS operating system. Due to the fairly large storage and backup needs of the consortium, the program focuses on RAID-based storage subsystems that use robust but commodity-level hard disk drives (SATA II) and connect to the servers using SCSI or SAS-based controllers. However, as PISCO’s networked systems have grown, a Storage Area Network (SAN) solution was recently installed at UC Santa Barbara to allow multiple server systems to simultaneously access the back-end storage disk enclosures. This reduces the complexity of the server and network systems, and also allows researchers to access any of their files from any of the servers at UCSB. For those interested in PISCO’s computing hardware details, the following table highlights the current server and storage systems at each university.
| Description | Server System | Server Memory | Storage System |
|---|---|---|---|
| OSU File server | HP ProLiant DL380 G3, 2x1.8 Ghz Quad Core XEON | 16 GB | 1.8 TB SATAII, U320 SCSI RAID |
| OSU Database server | HP ProLiant DL360 G4, 2x3.4 Ghz XEON | 8 GB | 1.6 TB SATAII, U320 SCSI RAID |
| OSU Application server | HP ProLiant DL380 G5 dual Quad Core XEON | 16 GB | 1.6 TB SATAII, U320 SCSI RAID |
| UCSC File server | IBM E326m, 2x2.4 Ghz Dual Core Opteron | 7 GB | 5.6 TB SATAII, U320 SCSI RAID |
| UCSC Database server | HP ProLiant DL380, 2x3.0 Ghz Quad Core XEON | 12 GB | .58 TB SAS + 12 TB SATAII, SAS RAID |
| UCSB File server | IBM x336 2x3.6 Ghz XEON | 7 GB | 2.4 TB SATAI, U320 SCSI RAID + SAN |
| UCSB Database server | Dell 2950 2x3.0 Ghz Dual Core XEON | 16 GB | .43 TB 15K SAS RAID |
| UCSB Application server | Penguin Relion 1650 2x3.0 Ghz Quad Core XEON | 16 GB | 2.0 TB SATAII, U320 SCSI RAID + SAN |
| UCSB Web server | Penguin Relion 1650 2x3.0 Ghz Quad Core XEON | 16 GB | 1.2 TB 15K SAS RAID + SAN |
| UCSB Storage server | Apple XServe 1x2.8 Quad Core XEON | 16 GB | 3.0 TB SATAII, U320 SCSI RAID + SAN |
| UCSB Storage server | Apple XServe 2x2.8 Quad Core XEON | 16 GB | 3.0 TB SATAII, U320 SCSI RAID + SAN |
| UCSB Storage Area Network | QLogic 5602 Sanbox Fibre Channel switch | - | - |
| UCSB Storage Area Network | Dell PowerConnect 5450 Gigabit Ethernet switch | - | - |
| UCSB Storage Area Network | Promise VTrak 16x750GB Fibre Channel RAID | - | 12 TB SATAII, 4 Gbps Fibre Channel |
| UCSB Storage Area Network | Promise VTrak 8x750GB Fibre Channel RAID | - | 6.0 TB SATAII, 4 Gbps Fibre Channel |