CERN Logo

HEP scientists use particle accelerators to smash particles together at high speeds in the search for new particles, which involves identifying events of interest from background processes. 
The Large Hadron Collider (LHC)  is the world’s largest particle accelerator. 
Experiments installed at the LHC use detectors to analyse the myriad of particles produced by collisions in the accelerator. These experiments are run by collaborations of scientists from institutes all over the world.

Field of study

High Energy Physics

The Business challenge

Jupyter Notebook has become the “de facto” platform used by data scientists to build interactive applications and to tackle big data and AI problems. Hybrid cloud resources combined with modern Big Data Tools (such as Spark ) may be efficiently used to allow physicists to interactively work on much larger datasets.
For effective collaboration, we need to allow scientists to easily access resources on different institutions.
JupyterLab, a next-generation web-based user interface for Project Jupyter, provides flexible building blocks for interactive, exploratory computing and assembles these components together to provide a full, IDE-like experience. 
Tools for accessing and sharing remote resources should be provided inside JupyterLab environment, as a part of this comprehensive, IDE-like experience. This is a key factor for effective collaboration in Data Science tasks.

The Scientific challenge

Particles collide at high energies inside CERN's detectors, creating new particles that decay in complex ways as they move through layers of subdetectors. The subdetectors register each particle's passage and microprocessors convert the particles' paths and energies into electrical signals, combining the information to create a digital summary of the "collision event". 
Analysing data streams from detectors (with increasing data challenges) requires the collaboration of many, geographically distributed science teams. 

The Technical challenge

The Large Hadron Collider produces unprecedented volumes of data. The raw data per event is about 1MB, produced at a rate of about 600 million events per second (600TB per second, or 50 000 PB per day). Data streams from LHC increase with each upgrade, which requires constant innovation in tools and methods for stream analytics.
Data analytics tasks and innovations are performed by distributed teams of scientists from institutes all over the world, with a variety of storage systems and processing tools. Because of this, one of the most important challenges is HEP research is providing tools in this distributed environment for effective collaboration in Data Science.


The Solution

Task T4.1 integrates data science environments into the federated Science Mesh, in order to facilitate collaborative research and enabling cross-federation sharing of computational tools, algorithms and resources. Users will be able to access remote execution environments to replay (and modify) analysis algorithms without a need to set up upfront accounts in the remote system. 
For distributed data science environments we develop Cs3api4Lab JupyterLab extension, integrating with ScienceMesh – providing share and collaboration functionalities (SC3 client in JupyterLab).

The scientific, societal, economical and policy impact

Distributed Data Science environments (Cs3api4Lab JupyterLab extension) is generic and can be universally deployed across mesh nodes. Users will be able to access remote execution environments to replay (and modify) analysis algorithms without a need to set up upfront accounts in the remote system. This allows us to benefit from the integrated environment and switch from multiple tools such as emails, file-sharing services, and various cloud storage interfaces into one shared by users standalone science tool. 
There will no longer be a need for installing and configuring standalone applications instead we can use a simple browser to handle a web application exposed to the science environment.

Without CS3MESH4EOSC

Without CS3Mesh data science, environments will be disconnected.
The user will be required to create guest accounts or perform other administrative actions. Data Science tasks over shared data will be affected, and collaboration more difficult.

With CS3MESH4EOSC

A user will be able to easily access via the web interface Data Science Environments at the remote site to interactively work on algorithms and data processing programs shared with him. This will “bring algorithms and interactivity close to data” in cases where it does not make sense to transfer data. 


WHO BENEFITS & HOW?

icons End-users and research communities

A federation between CS3 systems used by the involved partners allows sharing input and output data in an efficient way. Collaborative editing of Jupyter notebooks allows co-development of the analysis code and description of the methods. 

icons Institutional operators and services

Services deployed in external clouds are federated in the ScienceMesh and users are able to seamlessly access the functionality or the data they require in a transparent way, fully integrated with the in-house services. 
Transparent access and integration of additional storage and computing capacity available on commercial cloud providers.
Such additional capacity will be elastically provisioned according to the demand and considered as an extension of on-premise storage and computing resources.

icons Commercial software developers

Commercial software developers and data scientists can use distributed Data Science environments and data analytics patterns to develop a new product based on complex, large scale data in virtually all sectors (Finance, IoT, SmartCities, energy and many others).

icons Policy makers & citizens

Standardised tools and distributed Data Science environments will also facilitate the collaboration of citizen scientists and Citizen Observatories who can easier engage in Earth Observation research.

Others

Distributed Data Science environments can support various educational activities, as well as enhancing the learning process and promoting the collaboration of young science enthusiasts.

Services used

Data science environments

Technical implementation

JupyterLab extension integrates with ScienceMesh – provides share and collaboration functionalities (full CS3 API client) directly in JupyterLab environment. Additional share functionalities are available in the file browser: the extension replaces the default JupyterLab file manager, adding new UI elements for additional share functionalities (a shared by/with tab, sharing buttons and entries in the context menus) and adding new modal windows to display file information and sharing status.

Contribution to the EOSC

Distributed Data Science environments will support research and development in all disciplines, facilitating the collaboration of distributed science teams.

Future developments

Concurrent collaboration of Notebooks, enhancing user interface for JupyterLab extension.