Data Services for ALPS: Challenges and Opportunities
Victor Zykov and Allison Miller
Thousands of autonomous and Lagrangian platforms and sensors (ALPS) operating throughout most of the global ocean (Roemmich et al., 2009; Rodero Castro et al., 2016) have transformed ocean science in the years following the original ALPS workshop (Rudnick and Perry, 2003). New ALPS technologies have enabled persistent in situ observation of many important ocean properties. More complex data are being captured, of greater variety, and at a faster rate today than ever before. The growing flow of ALPS data offers unprecedented opportunities to advance ocean sciences. It also creates challenges with storage, transmission, processing, and analysis of the data. Such challenges are not unique to ALPS, as the rise of Big Data (Marr, 2015) has affected many areas of human endeavor. Created to address the problems of Big Data, global networks of interconnected data centers provide critical support infrastructure for the scalable storage, transmission, and analysis of large, dynamic, and distributed data sets (Yang et al., 2017). These services are referred to as cloud computing. To support the development of effective data services for ALPS applications, here we review the challenges of Big Data in ALPS and new technologies that are becoming available to help address them.
Sources of Big Data in ALPS
Thanks to improved reliability, energy efficiency, and endurance, modern marine robotics are becoming capable of persistent high-resolution ocean observing across a wide range of spatiotemporal scales (Figure 1). ALPS sensors have diversified to enable autonomous in situ measurement of ocean properties that previously required manual characterization. These include, for example, concentrations of dissolved oxygen (Martz et al., 2008), nitrates, pH (Wanninkhof et al., 2016), chlorophyll fluorescence, downward irradiance, and optical backscattering (Claustre et al., 2010), a proxy for colored dissolved organic matter (Cyr et al., 2017). The rate of oceanographic data collection has been amplified by the increasing spatial, temporal, and spectral resolutions of new sensors, such as synthetic aperture (Hayes and Gough, 2009) and imaging (Langkau et al., 2012) sonars, laser-based three-dimensional mapping systems (Duda et al., 2016), cameras (Roman et al., 2011), imaging spectrometers (Lucieer et al., 2014; Ekehaug et al., 2015), and holographers (Talapatra et al., 2013). Spatiotemporal analysis of oceanic phenomena via numeric modeling of acquired sensor data produces even more Big Data (Alvarez and Mourre, 2012; Sabo et al., 2014; Chen and Summers, 2016). This accelerating influx of data prevents new data analysis and interpretation from keeping up with the rate of data accumulation. Fortunately, powerful information technologies have been developed to help bridge this gap, as we discuss in the following sections.
Storing data on local personal computers or hard disk drives is risky and inefficient. Disks fail with age and RAID arrays won’t scale as fast as incoming Big Data. Distributed file systems (DFS) (Silberschatz et al., 1998) have been developed for scalable fault-tolerant storage of large volumes of data spread across many networked servers for speed and redundancy. Some of them are proprietary, such as IBM’s GPFS (Schmuck and Haskin, 2002) or Google’s GFS (Ghemawat et al., 2003), while others are open, such as Hadoop DFS30, an open source clone of GFS. Distributed data storage is available as a service from many cloud service providers, such as Amazon, Google, and Microsoft at costs often lower than those of on-premise hardware, maintenance, and operational staff, yet with far superior reliability. Metadata are essential for cross-domain collaborations that require data integration, such as record linkage, schema mapping, and data fusion (Dong and Divesh, 2015). Metadata help to automatically resolve diverse data sources and facilitates large-scale interoperability and analysis across data sets (Agrawal et al., 2011). Automation of metadata creation and stewardship is an open issue that requires focused coordination among the ALPS and data services developers and operators.
New ALPS data are often transmitted to storage via satellite communications (Bishop and Wood, 2009; http://www.argo.ucsd.edu/How_Argo_floats.html). While associated costs can be a concern, they have been declining for decades due to the expanding bandwidth capacity of new satellites (Williams, 2017). Some ALPS gather volumes of data that are too large for satellite or acoustic transmission, or may even create operational bottlenecks with offline upload (Holland et al., 2016). In these cases, in situ data processing and/or reduction can be advisable, such as pre-classification of observations by Ocean Carbon Explorers (Bishop, 2009) or sonar data processing on mapping autonomous underwater vehicles (Roman et al., 2011). By reducing data on board (with remote monitoring, where possible), transmission delays can be mitigated and inferences can be made available for automatic (or interactive) decision support in near-real time. Data may need to be reassembled from several storage locations for processing or analysis. For best efficiency, manual data transfers (such as file download or upload) should be minimized or eliminated to avoid bottlenecks in scaling up the performance of data services in step with the growth of Big Data. All major commercial clouds already come with high degree of automation for in-cloud data transfers, for example, from low to high availability storage, between storage and compute engines, automatic rebalancing within and across geographic regions, support for cross-cloud data transfer, and API interfaces for further workflow automation. If the volume of ALPS data is prohibitively large for ingress over the network, upload from physical media is also supported by the major cloud service providers.
For data to be useful, they need to be easy to search, subset, query, annotate, clean, and append, and they should accept these and other transactions on arbitrary numbers of their elements, rows, or tables. It can be challenging, however, to guarantee accurate execution of these tasks, particularly if the data are voluminous, dynamic, and/or distributed across multiple servers, and relations among their components need to be preserved. These challenges are typically addressed using relational database management systems (RDBMS) controlled with structured query languages (SQL). Traditional SQL RDBMS solutions, such as MySQL (http://www.mysql.com) or IBM DB2 (http://www.ibm.com/analytics/us/en/technology/db2), use centralized software architectures, making them incompatible with distributed storage and processing needs of Big Data. Alternative NoSQL (Pokorny, 2013) architecture was developed to scale with the needs of Big Data, however, with no transactional consistency guarantees.
The latest NewSQL tools combine the benefits of Big Data scalability and SQL transactional consistency. Examples include MemSQL (http://www.memsql.com), VoltDB (http://www.voltdb.com), Google Spanner (Corbett et al., 2013), SAP HANA (https://en.wikipedia.org/wiki/SAP_HANA), and an open source Apache Trafodion (http://trafodion.incubator.apache.org). NewSQL RDBMS can be complicated to run on premise, however, they are available cost-efficiently as a service from several cloud providers. To optimize geospatial data analyses, some RDBMS have introduced spatial data indexing, for example, SQL Server (https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview) and H2GIS (http://www.h2gis.org). However, geographic information system software interfacing with RDMBS can define and maintain its own spatial indices, as is the case with ArcGIS (http://desktop.arcgis.com/en/arcmap/10.3/manage-data/geodatabases/an-overview-of-spatial-indexes-in-the-geodatabase.htm). Application of these and other advanced information technologies is the focus of EarthCube (Peckham et al., 2014), a US National Science Foundation-funded program to transform geoscience research (including ocean sciences) by developing cyberinfrastructure to improve access, sharing, visualization, and analysis of all geosciences data and related resources (https://www.earthcube.org).
Most ocean scientists still analyze data by running custom scripts (often in MATLAB) on data sets on their local computers (Thomson and Emery, 2014). The growing volume, velocity, and variety of Big Data are making such approaches inadequate. Greater scalability can be achieved by analyzing large data sets with high performance parallel cloud computing as a service. This approach offering many benefits, such as the following:
• Analytical scripts and methods can be openly shared in the cloud and collaboratively developed as open source software. Persistent improvement and open availability of the analytical methods will stimulate their broader use, reduce barriers to entry into marine data analysis, and minimize the duplication of software development efforts.
• Hosting oceanographic data in the cloud ensures its safety and security. It is an effective approach to maximizing the data value for the scientific community. Sharing a data set with other cloud users makes it discoverable, searchable, and available for analysis, for example, with open source tools co-developed within the user/developer community.
• Cloud data services and access to cloud-hosted data can be automated with APIs.
• By running analytical software in the cloud, all the advantages of on-demand cloud computing can be leveraged. For example, compute resources can only be allocated and paid for when the scripts are running. The amount of resources can be fine tuned, often automatically, to match the needs. Analyses will complete faster thanks to elastic on-demand parallel computing, with no need to buy or manage servers.
• With algorithms and data co-hosted in the cloud, there is no need to download or upload data sets, which eliminates a key logistical bottleneck. Data transfers within the cloud are cheap or free and optimized for performance.
• The market of cloud computing is very competitive, pushing companies to improve the quality and expand the scope of services, while reducing the prices. This favorable dynamic is driven by much greater economic incentives than those available for technology development within the oceanographic community. This offers scientists a rare opportunity to benefit from very-well-funded rapid technical innovation.
• NewSQL RDBMS have been engineered from ground up to support a high volume of globally distributed data transactions with precision while simultaneously analyzing dynamic data and using inferences to automatically adjust various business processes in real time, for example, web content/traffic control. This infrastructure offers exciting opportunities for further automation (Stammer et al., 2016) of ALPS-based ocean research and data analysis.
Established analytical and modeling tools in marine sciences (Glover et al., 2011; Thomson and Emery, 2014) range from methods for initial data QA/QC and statistical error handling to principal component, factor, and frequency domain decompositions, spatiotemporal and dynamic analyses, and many modeling and visualization techniques. In deciding what tools should be implemented as cloud services first, one could consider what alternative implementations of the above established or new emerging tools (e.g., deep learning, clustering, semantic analysis, data annotation) may already exist and enjoy high demand in the community and could be moved into the cloud with incremental effort.
Agrawal, D., S. Das, and A. El Abbadi. 2011. Big data and cloud computing: Current state and future opportunities. Pp. 530–533 in Proceedings of the 14th International Conference on Extending Database Technology, March 21–24, 2011, Uppsala, Sweden, ACM, https://doi.org/10.1145/1951365.1951432.
Alvarez, A., and B. Mourre. 2012. Oceanographic field estimates from remote sensing and glider fleets. Journal of Atmospheric and Oceanic Technology 29(11):1,657–1,662, https://doi.org/10.1175/JTECH-D-12-00015.1.
Bishop, J.K.B. 2009. Autonomous observations of the ocean biological carbon pump. Oceanography 22(1):182–193, https://doi.org/10.5670/oceanog.2009.48.
Bishop, J.K.B., and T.J. Wood. 2009. Year-round observations of carbon biomass and flux variability in the Southern Ocean. Global Biogeochemical Cycles 23, GB2019, https://doi.org/10.1029/2008GB003206.
Chen, J.L., and J.E. Summers. 2016. Deep neural networks for learning classification features and generative models from synthetic aperture sonar big data. The Journal of the Acoustical Society of America 140(4):3423, https://doi.org/10.1121/1.4971014.
Chen, J.L., and J.E. Summers. 2016. Deep neural networks for learning classification features and generative models from synthetic aperture sonar big data. Proceedings of Meetings on Acoustics 29:032001, https://doi.org/10.1121/2.0000458.
Claustre, H., J. Bishop, E. Boss, S. Bernard, J.-F. Berthon, C. Coatanoan, K. Johnson, A. Lotiker, O. Ulloa, M.J. Perry, and others. 2010. Bio-optical profiling floats as new observational tools for biogeochemical and ecosystem studies. In Proceedings of the OceanObs’09: Sustained Ocean Observations and Information for Society Conference. J. Hall, D.E. Harrison, and D. Stammer, eds, Venice, Italy, September 21–25, 2009, ESA Publication WPP-306, https://doi.org/10.5270/OceanObs09.cwp.17.
Corbett, J.C., J. Dean, M. Epstein, A. Fikes, C. Frost, J.J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, and others. 2013. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems (TOCS) 31(3):8.
Cyr, F., M. Tedetti, F. Besson, L. Beguery, A.M. Doglioli, A.A. Petrenko, and M. Goutx. 2017. A new glider-compatible optical sensor for dissolved organic matter measurements: Test case from the NW Mediterranean Sea. Frontiers in Marine Science 4:89, https://doi.org/10.3389/fmars.2017.00089.
Dong, X.H., and S. Divesh. 2015. Big data integration. Synthesis Lectures on Data Management 7(1):1–198, https://doi.org/10.2200/S00578ED1V01Y201404DTM040.
Duda, A., T. Kwasnitschka, J. Albiez, and F. Kirchner. 2016. Self-referenced laser system for optical 3D seafloor mapping. In OCEANS 2016 MTS/IEEE Monterey, September 19–23, 2016, Monterey, CA, IEEE, https://doi.org/10.1109/OCEANS.2016.7761203.
Ekehaug, S., I.M. Hansen, L.M.S. Aas, K.J. Steen, R. Pettersen, F. Beuchel, and L. Camus. 2015. Underwater hyperspectral imaging for environmental mapping and monitoring of seabed habitats. In OCEANS 2015-Genova, May 18–21, 2015, Genoa, Italy, IEEE, https://doi.org/10.1109/OCEANS-Genova.2015.7271703.
Ghemawat, S., H. Gobioff, and S.T. Leung. 2003. The Google file system. ACM SIGOPS Operating Systems Review 37(5)29–43, https://doi.org/10.1145/945445.945450.
Glover, D.M., W.J. Jenkins, and S.C. Doney. 2011. Modeling Methods for Marine Science. Cambridge University Press, 588 pp.
Hayes, M.P., and P.T. Gough. 2009. Synthetic aperture sonar: A review of current status. IEEE Journal of Oceanic Engineering 34(3):207–224, https://doi.org/10.1109/JOE.2009.2020853.
Holland, M., A. Hoggarth, and J. Nicholson. 2016. Hydrographic processing considerations in the “Big Data” age: An overview of technology trends in ocean and coastal surveys. IOP Conference Series: Earth and Environmental Science 34(1):012016, https://doi.org/10.1088/1755-1315/34/1/012016.
Lampitt, R.S., P. Favali, C.R. Barnes, M.J. Church, M.F. Cronin, K.L. Hill, Y. Kaneda, D.M. Karl, A.H. Knap, M.J. McPhaden, and others. 2010. In situ sustained Eulerian observatories. In Proceedings of OceanObs’09: Sustained Ocean Observations and Information for Society (Vol. 1). September 21–25, 2009, Venice, Italy, J. Hall, D.E. Harrison, and D. Stammer, eds, ESA Publication WPP-306, https://doi.org/10.5270/OceanObs09.pp.27.
Langkau, M.C., H. Balk, M.B. Schmidt, and J. Borcherding. 2012. Can acoustic shadows identify fish species? A novel application of imaging sonar data. Fisheries Management and Ecology 19(4):313–322, https://doi.org/10.1111/j.1365-2400.2011.00843.x.
Lucieer, A., Z. Malenovský, T. Veness, and L. Wallace. 2014. HyperUAS—Imaging spectroscopy from a multirotor unmanned aircraft system. Journal of Field Robotics 31(4):571–590, https://doi.org/10.1002/rob.21508.
Marr, B. 2015. Big Data: Using SMART Big Data. Analytics and Metrics To Make Better Decisions and Improve Performance. Wiley, 256 pp.
Martz, T.R., K.S. Johnson and S.C. Riser. 2008. Ocean metabolism observed with oxygen sensors on profiling floats in the South Pacific. Limnology and Oceanography 53, https://doi.org/10.4319/lo.2008.53.5_part_2.2094.
Peckham, S.D., C. DeLuca, D.J. Gochis, J. Arrigo, A. Kelbert, E. Choi, and R. Dunlap. 2014. EarthCube-Earth System Bridge: Spanning scientific communities with interoperable modeling frameworks. Paper presented at the Fall Meeting of the American Geophysical Union, San Francisco, CA, abstract #IN31D-3754.
Pokorny, J. 2013. NoSQL databases: A step to database scalability in web environment. International Journal of Web Information Systems 9(1)69–82, https://doi.org/10.1108/17440081311316398.
Roemmich, D., G.C. Johnson, S.C. Riser, R.E. Davis, J. Gilson, W.B. Owens, S.L. Garzoli, C. Schmid, and M. Ignaszewski. 2009. The Argo Program: Observing the global ocean with profiling floats. Oceanography 22(2):34–43, https://doi.org/10.5670/oceanog.2009.36.
Rodero Castro, I., and M. Parashar. 2016. Architecting the cyberinfrastructure for National Science Foundation Ocean Observatories Initiative (OOI). Instrumentation Viewpoint 19(48): 99–101. Lecture given at the 7th International Workshop on Marine Technology: MARTECH 2016, SARTI.
Roman, C., G. Inglis, and B. McGilvray. 2011. Lagrangian floats as sea floor imaging platforms. Continental Shelf Research 31(15):1,592–1,598, https://doi.org/10.1016/j.csr.2011.06.019.
Rudnick, D.L., and M.J. Perry, eds. 2003. ALPS: Autonomous and Lagrangian Platforms and Sensors, Workshop Report. 64 pp, https://geo-prose.com/pdfs/alps_report.pdf.
Sabo, T.O., R.E. Hansen, and A. Austeng. 2014. Synthetic aperture sonar tomography: A preliminary study. In Proceedings of EUSAR 2014; 10th European Conference on Synthetic Aperture Radar, June 3–5, 2014, Berlin, Germany, VDE.
Schmuck, F.B., and R.L. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. Pp. 231–244 in FAST ‘02: Proceedings of the 1st USENIX Conference on File and Storage Technologies, January 28–30, 2002, Monterey, CA.
Silberschatz, A., P.B. Galvin, and G. Gagne. 1998. Distributed file systems. Chapter 17 in Operating System Concepts, vol. 4. Addison-Wesley, Reading.
Stammer, D., M. Balmaseda, P. Heimbach, A. Köhl, and A. Weaver. 2016. Ocean data assimilation in support of climate applications: Status and perspectives. Annual Review of Marine Science 8:491–518, https://doi.org/10.1146/annurev-marine-122414-034113.
Talapatra, S., J. Hong, M. McFarland, A.R. Nayak, C. Zhang, J. Katz, J. Sullivan, M. Twardowski, J. Rines, and P. Donaghay. 2013. Characterization of biophysical interactions in the water column using in situ digital holography. Marine Ecology Progress Series 473:29–51, https://doi.org/10.3354/meps10049.
Thomson, R.E., and W.J. Emery. 2014. Data Analysis Methods in Physical Oceanography, 3rd ed. Elsevier Science, 728 pp.
Wanninkhof, R., K. Johnson, N. Williams, J. Sarmiento, S. Riser, E. Briggs, S. Bushinsky, B. Carter, A. Dickson, R. Feely, and others. 2016. An evaluation of pH and NO3 sensor data from SOCCOM floats and their utilization to develop ocean inorganic carbon products: A summary of discussions and recommendations of the Carbon Working Group (CWG) of the Southern Ocean Carbon and Climate Observations and Modeling project (SOCCOM). SOCCOM Carbon System Working Group white paper, 30 pp.
Williams, M. 2017. SpaceX details plans to launch thousands of internet satellites. Phys.org, May, 8, 2017, https://phys.org/news/2017-05-spacex-thousands-internet-satellites.html.
Yang, C., Q. Huang, Z. Li, K. Liu, and F. Hu. 2017. Big Data and cloud computing: Innovation opportunities and challenges. International Journal of Digital Earth 10(1):13–53, https://doi.org/10.1080/17538947.2016.1239771.
Victor Zykov, Schmidt Ocean Institute, Palo Alto, CA, USA, firstname.lastname@example.org
Allison Miller, Schmidt Ocean Institute, Palo Alto, CA, USA, email@example.com