Tag Archives: data sharing

Blockchain insights with Data61 researcher Dr Mark Staples

Dr Mark Staples is a blockchain researcher at Data61, which is part of Australia’s federal science organisation, CSIRO. Being both a scientist and a blockchain expert, he has rare insights into how blockchain can propel research.

On my last trip to Brisbane I caught up with Mark for a drink at the Plough Inn and asked him to answer some of science’s most burning blockchain questions.

In this interview, we take a look at the challenges scientists face in managing their data, how blockchain can help, and where we’re at when it comes to issues of confidentiality, scalability, cybersecurity and policy.

First of all Mark, could you tell us a bit about your background?

My background is in computer science, cognitive science, and then eventually I got into formal methods and software engineering. But these days, I’m mostly looking at a lot of work around blockchain. I do blockchain research at Data61 — mainly around software architectures for blockchain-based applications.

And can you tell us what’s happening in Australia on the blockchain front?

Australia is doing quite a lot of work around blockchain. The Commonwealth Bank has had some world firsts around the use of blockchain for the trade and also for bond issuance. Companies like AgriDigital have also had some world firsts for use of blockchain to track the agricultural supply chain. Australia’s leading the standardisation process — the international standardisation work on blockchain and distributed ledger technology. So Australia is quite present in blockchain internationally and leading in some areas.

What areas of blockchain research are Data61 focused on?

The area where we’ve been leading in research has been using blockchain as a way of executing business processes. So, taking business process models and turning them into smart contracts to execute multi-party business processes on blockchain.

We’ve also been thinking about ways to take legal logics to represent contracts or regulation, and turning those into smart contracts. We do some work in the Internet of Things for blockchain as well. And supply chain integrity.

So, there’s a variety of different pieces of research, and then we work with companies; we develop technology, and we participate in the international standardisation of blockchain.

Being a scientist yourself, how do you see blockchain propelling science?

The key thing that blockchain supports is data sharing and data integrity. Both of those are critical for science.

Normal blockchains are not so good for confidentiality, but they’re great for publishing stuff; they’re great for publicity. One of the barriers for the adoption of blockchain in enterprises includes challenges around managing commercial confidentiality. But for a lot of science publications — both low-risk data and papers — they want to be public and blockchain is good for that.

Not only will it be public, but you also get this trail of what’s happened to the data. You get some sort of evidence about the integrity or authenticity of the records that are being created as well, by relying on the cryptographic techniques inherent in blockchain.

So I think that’s the key potential for blockchain for science — better publishing of scientific datasets and publications with better support for integrity.

Is data management a big issue?

Yes, we’re not very good yet at managing data integrity or sharing datasets or getting recognition or citation for datasets that we’ve collected or used. Not only from a professional point of view — scientific impact analysis and the like — but also in understanding data integrity from a scientific validity standpoint.

We need to be able to answer questions like: What operations have been done to your dataset before you start doing your own operations on it? How was the data that you’re working with collected? Has it been cleaned or not? All those questions are important when you’re doing an analysis of the data.

What issues have you observed in your time as a scientist in terms of how the scientific data is managed and applied?

There are a lot of data description challenges. Have you described what are all the important characteristics of a dataset? How do you describe those? There’s a variety of standards for metadata for datasets.

How do you describe the history of the provenance for data? What steps were taken in the collection or the analysis of a dataset and derived datasets? All of those are not really completely solved problems. We don’t have standard solutions for a lot of them, so that’s one challenge.

Do you think blockchain’s support for data integrity might actually help reinstate or build better trust in scientific evidence?

Yes, potentially, it could create more evidence for the trustworthiness of data and more evidence that data has been analysed if we used it in the right way.

And what particular difficulties are there in actually getting these systems adopted by universities or research institutes?

Blockchain is good if you want to make datasets public. But there are certainly a lot of datasets in science that are not public for various reasons — especially in the medical research area. So, they present much more of a challenge; you can’t necessarily just publish those datasets through a blockchain.

You might still be able to use a blockchain and other kinds of digital fingerprinting techniques to provide evidence about the integrity of data that you’re using without compromising official privacy, but it gets complicated to manage that kind of thing. So that would be one of the main challenges.

Could you put metadata or de-identified data on the blockchain as a solution to the confidentiality problem?

If you just have high-level metadata on the blockchain that can be be okay. If you have aggregate statistics in there, then you need to start worrying about the version you’re releasing as well. But a very high-level purely descriptive dataset is less likely to be a problem.

De-identified datasets are difficult. It’s a real challenge to effectively de-identify data. We’ve seen so-called de-identified data sets that have been susceptible to re-identification attacks, so that’s a difficult problem. We have a couple of teams in Data61 looking at private data release and private data analytics.

Are there any particular challenges for scientists on an individual level, when it comes using blockchain-based systems?

One practical challenge is that all the public blockchains and most of the private blockchains rely on public-private cryptography, which means public-private key management. In order to create a transaction to report some data on a public blockchain, you need to be managing a private key to be able to digitally sign the data that you’re transacting with.

There are various bits of software that can help to manage that. There’s wallet software, for example. But still, it’s a new thing scientists will need to do to manage keys and to have good cybersecurity in key management. Because these blockchains allow people to enact things themselves on the blockchain — to directly interact with the blockchain.

Blockchain creates a responsibility for people to be able to manage their cryptographic identities with integrity as well. The integrity of your data can come down to how good you are at cybersecurity and how you protect yourself against cyber-attacks. It requires effective cryptographic key management by people who are not used to doing it. So, that becomes another barrier to using blockchain.

Is scalability still a problem?

That’s an inherent problem with blockchain. Blockchain is meant to be a distributed database where you might have thousands of copies of data all around the world. Big data in terms of big volume is just inherently hard to move over the network. So it’s inherently hard to replicate around the blockchain nodes all around the world.

But I think we already know the solution from a big data point of view. The blockchain-based system is never implemented just with blockchain alone. It’s always implemented with a variety of other auxiliary systems — whether that’s just key management or maybe also user interfaces or off-chain databases for private data or big data. So, I think that’s the solution; just kick the big data off-chain.

Apart from big data, there are other scalability challenges for blockchain in terms of transaction latency. These things are being worked on, so I don’t see them being a huge problem in the medium term.

Are there other ways that blockchain will need to develop before it can support large-scale research?

Another big challenge is governance of blockchain-based systems. So, normal IT governance assumes there’s a single source of authority that’s in control of an IT system, and so the adoption of that control and the evolution of that system can be controlled from the top through that source of authority.

But with many blockchain-based systems there’s no single source of authority. It might be a collective that’s operating it, or the collective might be random groups in the public.

So, how to control the evolution and management of the blockchain-based system can be a difficult problem. Some blockchains are implementing governance features directly on the blockchain, but it’s not clear yet what the best way to go is, and it’s still an active area of innovation.

Is there scope for greater funnelling of clinical data and consumer data back into research?

I think the biggest challenges there are policy challenges, not so much technical challenges.

What does a good security policy model involve for clinical information sharing? Who should be allowed to see what data, for what purpose and when, under what consent model? Even that is not very clear at a policy level at the moment.

So in terms of research ethics applications, there’s a huge variety of different consent models that are supported by specific ethics approvals. You can implement technical controls for any of those, but knowing what you should be implementing is, I think, the hardest part of the challenge. There’s a lot of variability, especially for clinical information.

Have you seen movement towards giving the individual control over their data over the long term?

There are some interesting things happening in that space. Are you familiar with the Consumer Data Right that the government recently announced? The first incarnation of it is something called open banking, where the government creates a right for consumers to direct their bank to share information about their personal accounts with a third-party. The individual has to give consent to the third-party to use their data for a particular purpose, but then they give an authorisation and direction to the bank to authorise the third-party to access their data.

That’s an interesting model for giving consumers more right to direct where their data goes on a case by case basis. It’s quite different to most of the other models I’ve seen for giving consent to share data. Normally, an organisation holds data about a consumer, and the consumer is trying to keep up with all the various consents to access it—derived and delegated consent, emergency accesses and whatever other accesses are made to their information.

But when it comes to clinical information, I think policy is complicated by a lot of different interests. I don’t know if we have a good answer to that.

It’ll be interesting to see how it all unfolds Mark. Stellar insights. Thanks for your time!

To find out more about blockchain research at Data61 and to read their reports on how can be applied to government and industry, click here.

– Elise Roberts

This article was originally published on Frankl Open Science via Medium. Frankl works on solving issues around data sharing and data integrity in science, using blockchain and other technologies.

Data sharing out of the blue

In Greek mythology, the Argo was a ship sailed by Jason and his Argonauts in their quest for the legendary Golden Fleece. Today, marine industries and modern-day sailors on a different kind of quest also make use of an Argo. For almost 15 years, the Argo international ocean-monitoring project has been collecting and data sharing climate and oceanography research through sensor-equipped floats.

Australian researchers play a key role in this latter-day Argo, jointly led by the national science agency CSIRO and the University of California and involving 31 countries. Dr Peter Oke, a CSIRO Ocean Modelling and Data Assimilation Research Scientist for Australia’s own ocean forecasting system BLUElink, says Argo has changed the way researchers do business, encouraging data sharing and reuse, and spawning new systems like BLUElink.

“The Argo community has really led the way in creating a data sharing culture. By making data access free and open, it’s breaking down the silos once set up to collect and protect observations.”

“The Argo community has really led the way in creating a data sharing culture. By making data access free and open, it’s breaking down the silos once set up to collect and protect observations,” he says. BLUElink is an ocean forecasting system built on Argo data and the Ocean Forecasting Australia Model. CSIRO, the Bureau of Meteorology and the Royal Australian Navy collectively run the project, established in 2001. Forecasts are based on data including temperature, salinity, sea levels and currents, all measured in real-time at different locations and depths by the autonomous Argo floats.

In 2014, Oke and his colleagues used BLUElink to provide intelligence in the search for Malaysian Airlines Flight MH370, which vanished with 239 passengers on board.

“We dropped all tools in our support for the search. We made educated guesses about the splash points and used BLUElink to look at where the debris was most likely to go,” says Oke.

Then, in July this year, a wing part washed up on Réunion Island east of Madagascar. “We did historical runs to backtrack with BLUElink to see if the debris could be from the plane. Argo data was a key component.”

Argo float being deployed. Image source: ARGO.
Argo float being deployed. Image source: ARGO.

The current locations of over 3800 Argo floats appear on a map like confetti across the oceans. Each float’s sensors collect temperature and salinity data profiles at depths of up to 2000 m. Every 10 days the floats come to the surface to relay data to satellites. More than 10,000 profiles per month provide oceanographers, climate scientists and others with comprehensive subsurface ocean data, accessible via the IMOS (Integrated Marine Observing System) web portal. Since 1998, Argo data sharing has generated dynamic maps of ocean currents and resulted in over 2000 scientific research papers.

As for BLUElink, its users range from marine industries such as shipping companies to individuals like surfers and sailors. Each year, using forecasts from BLUElink, sailors in the famous Sydney to Hobart yacht race are briefed on the conditions they’ll encounter on some of the world’s roughest seas.

Argo data improve safety for oil and gas workers and help analyse risks of oil spills on sensitive coastlines. Data sharing also inform decisions about fishing area boundaries and catch limits.

Oke is particularly excited that the field of operational oceanography, which aims to make ocean monitoring and prediction routine. He sees improved ocean forecasts resulting from the Global Ocean Data Assimilation Experiment (GODAE), which is exploring how BLUElink can be used more efficiently. Data from sensors on marine gliders closer to shore will also be integrated with Argo data sharing to create new coastal models.

“One day, thanks to Argo, we’ll see ocean forecasts as reliable as the weather forecasts that we check-in on every day,” Oke says.

Story provided by Refraction Media.

Originally published in Share, the newsletter magazine of the Australian National Data Service (ANDS).

Featured image source (above): ARGO.

Argo is a major contributor to the WCRP ‘s Climate Variability and Predictability Experiment (CLIVAR) project and to the Global Ocean Data Assimilation Experiment (GODAE). The Argo array is part of the Global Climate Observing System/Global Ocean Observing System GCOS/GOOS). Discover more about data sharing and Argo here.