Publishing Big Data

Thursday, April 3rd, 2014

Science lives and dies by data. Today, the field of water resources is in the era of Big Data. Most published research papers contribute a small amount of data in the form of field observations, etc., but many also rely on other large, usually public datasets and data services. What regional watershed study, for example, does not use standard GIS datasets of basin boundaries, elevations, and streams distributed by data services, i.e. Big Data? Ongoing initiatives may produce Big Data “apps” to examine critical issues such as climate change. How good is this Big Data? Should we trust it? Can we make it better? This is where peer-reviewed journals come in.

Critical though Big Data may be, its creators are not well recognized in published journals. Important insights learned in building Big Data often remain in the minds of its creators, and users are left to discover strengths and weaknesses for themselves. With peer-reviewed journal articles, the developers of Big Data have a chance to explain the role of their data services in furthering science. A journal also provides the opportunity for discussions and replies, which contribute to greater understanding of Big Data.

Writing about Big Data

So, how does one write about Big Data? The typical research article format of hypothesis-testing-conclusion is not a good fit for describing data services. Moreover, most Big Data already are documented with users guides and with metadata, such as the Federal Geospatial Data Committee’s “Content Standard for Digital Geospatial Metadata (CSDGM).” User guides primarily are “how to” instructions with little nuance. The CSDGM is intended more for description and documentation than for evaluation and discussion. If you need to know the coding for the attributes of a data set, the CSDGM metadata is the place to look. If you need to see how and why these codes were used, it can be less helpful.

In thinking about how to write a journal article about your Big Data, I take it for granted you already have user guides and CSDGM metadata. These should be referenced, but need not be repeated except as an introduction. What more do you need to say to help researchers understand your Big Data? I see these four questions as critical:

  1. What was original about constructing your Big Data, and what lessons were learned?
  2. What assumptions did you make, and how might they affect using Big Data?
  3. How did you test Big Data?
  4. What are the known strengths and limitations of Big Data?

Every Big Data effort is original. If it were easy, someone already would have done it! The tipping point for proceeding typically is a new technology (e.g. LiDAR) or a pressing need that finally makes available the necessary resources. The mechanics of construction usually are described elsewhere. But all Big Data efforts involve design choices. Why did you choose the resolution you did? Why was a particular range of attributes chosen? Why does the interface work this way and not that way? The reasoning behind these decisions may help users a little, but they could be invaluable for later developers working on the next generation of Big Data.

I might call assumptions, “Where the bodies are buried.” All data systems are compromises. Time and resources always force developers to accept some things as “givens.” To use the National Hydrography Dataset (NHD) as an example, one big assumption of the early versions was that streams began with the “blue lines” of USGS maps. This was not a very good assumption, but there was nothing better available. Much of the subsequent criticism and misunderstanding of the NHD might have been softened had this point been made clearly in a journal article.

Most Big Data is tested extensively in the planning and development stages. The journal article should reference pilot studies or anything that gives an insight into how the Big Data can be used in practical situations. Keep in mind, testing information often resides in the “gray literature” of contractor reports; the journal article can be key in finding this information.

Finally, you need to be honest about the strengths and limitations of your Big Data. Nobody knows these better than the developers! What uses did you have in mind for Big Data, and how does it fulfill these hopes? Give potential users a reasonable expectation of how Big Data can help them. Don’t be afraid to advise on what should be made better in future versions of Big Data.

Technology for Technology

The text of a journal article may not be a very good venue for demonstrating a data system. The best user manuals today incorporate video. Somebody sits down in front of a computer and, with the video camera running, goes through an example application. Please note the online version of JAWRA articles can link to such demonstrations.

The Role of Peer Review

The role of the journal article is to describe and critically examine Big Data. Once Big Data reaches the milestone of journal article preparation, the data system is pretty much is what it is. Recommending major changes may not be helpful, with the exception being where Big Data simply fails to do what it is claimed to do. The focus of peer review, therefore, should be on how well the article answers the four questions I raised earlier.

Concluding Thoughts

Big Data has many elements that require the explanation and reasoned discussion a peer-reviewed journal provides. The format may differ from that of traditional articles, but it is science nonetheless. JAWRA will welcome articles on Big Data.

Wednesday, January 1st, 2014

AWRA celebrates its golden anniversary in 2014. What was the world like fifty years ago?

The US population was 192 million (vs. 314 today). Ford introduced the Mustang, which cost about $2,400. There were no mileage stickers, but who cared, when you could fill up that baby for 30 cents per gallon? No wonder Lyndon Johnson swept to reelection over Barry Goldwater in November!

In popular culture, Elizabeth Taylor married Richard Burton — for the first time. The Gilligan’s Island series began its fabled run (“a three hour cruise”). The Beatles released “I Want to Hold Your Hand.” Muhammad Ali defeated Sonny Liston to become World Heavyweight Champ. There was no Super Bowl yet, but the Boston Celtics were the NBA champions and the St. Louis Cardinals won the World Series

The Soviet Union and “Red China” were our mortal enemies, the communist menace. In August, fearful of allies falling like dominoes, Congress passed the Gulf of Tonkin resolution, effectively declaring the Viet Nam War. When in Washington, D.C, please visit the memorial to 58,195 brave Americans who would give their lives.

On a more positive note, the Civil Rights Act became law, and Rev. Martin Luther King, Jr. won the Nobel Peace Prize. Typical of the times, nearly all AWRA charter members were white males. Many engineering schools, even those accepting all races, still did not admit women. When you go to an AWRA conference this year, look around at the wonderful diversity; such a meeting would have been almost impossible in the segregated South of 1964.

IBM introduced the System/360 computer in 1964, the first with a true operating system. You programmed it with 80-character punch cards, carrying your boxes of cards to the “computer room,” where operators accepted your offering and, eventually, gave you back a printout. Engineers and scientists did routine calculations on slide rules; there were no pocket calculators. Most phones had dials, all were tethered to cords, and “Ma Bell” owned every one of them.

Flying in the new Boeing 727, introduced the year before, was fun, an upscale activity. Many travelers wore a jacket and tie onto the airplane. No problem getting through security: there wasn’t any. In flight, a pretty stewardess served you a meal on real china.

The environmental movement was in its infancy: Silent Spring had only been published in 1962. The Wilderness Act was signed in 1964. The Surgeon General declared smoking hazardous to one’s health, but the Marlboro Man remained the image of manly ruggedness, and people still smoked almost everywhere. Yet over the horizon were: the Cuyahoga River fire (1969), NEPA (1969), EPA (1970), and the Clean Water Act (1972).

This was a time for starting journals. Some other journals that started in this period include: ASCE Journal of Sanitary Engineering (1956), Journal of Hydrology (1963), and Water Resources Research (1965).

In preparing a journal paper, you wrote your first draft in cursive on a pad of paper. Then a “secretary” – remember them? – typed it. Minor corrections were made with a gooey substance called “whiteout.” If you were lucky enough to have your paper accepted, the journal sent you a template on which you glued “camera-ready copy.”

Water Resources Bulletin, Volume 1, published in 1965, wasn’t much to look at, mostly a newsletter. (Hydata would later serve this purpose.) The first original technical paper, “Water Dynamics in the Soil-Plant Ecosystem,” by M. B. Russell, was published in 1966 under our first Editor, Randy Boggess. Three more original papers were published that year. Today, I am the eleventh in a distinguished line of Editors, and JAWRA publishes about 115 papers annually.

Government publication websites

Wednesday, October 2nd, 2013

If you need to access any US Government publication outlets today, don’t even bother. They’re shut down. This is causing some serious consternation in the scholarly publishing world. The government funding open-access publishing model is proving unreliable. And, this is before one even considers the chilling possibility of any government censorship.

The JAWRA website, based upon a subscription model — thank you, members and subscribers! — is up and open for business as usual.

Climate change and NYC watershed

Friday, September 6th, 2013

Early View article:Streamflow Responses to Climate Change: Analysis of Hydrologic Indicators in a New York City Water Supply Watershed,” by Soni M. Pradhanang, Rajith Mukundan, Elliot M. Schneiderman, Mark S. Zion, Aavudai Anandhi, Donald C. Pierson, Allan Frei, Zachary M. Easton, Daniel Fuka, and Tammo S. Steenhuis.

Established with great foresight over a century ago, the New York City watershed represents not only a critical source of drinking water, but also some of the most pristine aquatic habitat in the eastern U.S. I’ll let a good abstract describe this important study.

(Abstract) Recent works have indicated that climate change in the northeastern United States is already being observed in the form of shorter winters, higher annual average air temperature, and more frequent extreme heat and precipitation events. These changes could have profound effects on aquatic ecosystems, and the implications of such changes are less understood. The objective of this study was to examine how future changes in precipitation and temperature translate into changes in streamflow using a physically based semidistributed model, and subsequently how changes in streamflow could potentially impact stream ecology. Streamflow parameters were examined in a New York City water supply watershed for changes from model-simulated baseline conditions to future climate scenarios (2081-2100) for ecologically relevant factors of streamflow using the Indicators of Hydrologic Alterations tool. Results indicate that earlier snowmelt and reduced snowpack advance the timing and increase the magnitude of discharge in the winter and early spring (November-March) and greatly decrease monthly streamflow later in the spring in April. Both the rise and fall rates of the hydrograph will increase resulting in increased flashiness and flow reversals primarily due to increased pulses during winter seasons. These shifts in timing of peak flows, changes in seasonal flow regimes, and changes in the magnitudes of low flow can all influence aquatic organisms and have the potential to impact stream ecology.

Please note: I have quoted and paraphrased freely from the article, but the interpretation is my own.]

Hetch Hetchy fire

Tuesday, August 27th, 2013

While the possible burning of the Hetch Hetchy watershed would be a tragedy, it would not necessary have a great impact on San Francisco’s water supply. In a 2007 JAWRA paper, “REASSEMBLING HETCH HETCHY: WATER SUPPLY WITHOUT O’SHAUGHNESSY DAM,” Sarah Null and Jay Lund showed the Hetch Hetchy Reservoir no longer is critical, as other, newer reservoirs downstream would take up the slack.

