DataXplorers Hackathon

October 23rd – December 6th 2023

Organisation: NFDI4Biodiversity, NFDI4Earth & NFDI4Microbiota

DataXplorers unite! We warmly invite you to our Cross-Community Hackathon – an exciting opportunity for data enthusiasts of all disciplines, experienced or inexperienced!

What exciting ideas and results can emerge when curious minds from the earth system sciences, biodiversity and microbiology research come together and work on data collaboratively?

That is what we are going to find out with you in our NFDI Cross-Community Hackathon, which begins on October 23, 2023, with an online kick-off event.

We will offer a range of different challenges, which focus on data and tools that build bridges between our communities.

Who is organizing the hackathon?

The Hackathon is jointly organized by the three NFDI consortia: NFDI4Biodiversity, NFDI4Earth and NFDI4Microbiota. In the NFDI, the National Research Data Infrastructure Germany, valuable data from science and research are systematically accessed, networked and made usable in a sustainable and qualitative manner for the entire German science system.

What are the challenges?

1. BIIGLE-Challenge

Challenge Provider:

Prof. Dr.-Ing. Tim Nattkemper & Dr. Martin Zurowietz

University Bielefeld

Programming language: Python

Max. number of Teams: 2

Challenge Description:

Within biodiversity research, a fully automated annotation of image data by machine learning methods is in most cases not possible. The BIIGLE software for image annotation [1] includes the “MAIA” method [2] for semi-automated annotation of images, where automatically generated results have to be reviewed and evaluated by a human in a multi-step procedure. Efficient human-machine interaction is essential for semi-automated annotation to be faster than a purely manual annotation. In addition, the MAIA method was developed with a focus on deep-sea images. In other image domains, it often offers no advantage over purely manual annotation.

In this Challenge, the human-machine interaction should be improved for a method like MAIA for semi-automated image annotation. A procedure with several different steps could become a procedure with only one (possibly repetitive) step. Furthermore, by using modern machine learning methods, it is possible to develop a procedure that works in many different image domains and can adapt to the given data.

[1] https://doi.org/10.3389/fmars.2017.00083

[2] https://doi.org/10.1371/journal.pone.0207498

Goal:

A proof of concept of a method that is suitable for the efficient semi-automated annotation of image data in different domains.

Implemented in BIIGLE with a specially tailored user interface (not part of the challenge), this method would work better than the existing MAIA method.

Data:

Image data with annotations
Data can be provided over BIIGLE and downloaded from de.NBI Cloud

In BIIGLE „reports” can be generated that contain information about annotations

2. Inspection of colour information in scanned raster maps

Challenge Provider:

Anna Lisa Schwartz & Peter Valena

Generaldirektion Staatliche Archive Bayerns

Leibniz Institute of Ecological Urban and Regional Development (IÖR)

Programming language: Free choice, preferably Python

Max. number of Teams: 3

Challenge Description:

Data on historical land use are essential for Earth Sciences and biodiversity research. Scanned topographic or cadastral maps are a valuable source of information on past land use. These scans usually come from libraries or archives and with heterogeneity in quality, scan resolution, format and colour representation. However, when confronted with a collection of scanned maps, it is useful to know whether they contain colour information (i.e. water bodies in blue, isohypsis in orange, vegetation in green, etc.) or in summary just grey value information. Since visual inspection is time consuming, an automation method is sought.

It is important to note that, even if it is an image with coloured bands, the map information itself might be just represented as black white (imagine the scan of a black white map on old yellowish paper).

Goal:

The goal is to implement a tool that allows for the automatic colour inspection of directories or collections of scanned maps and their further sorting by colour information or grey value information. Ideally the tool also gives a hint on which land use classes might be represented in a map and extracts information on colour, grey value as well as ideally land use class in a documented XML-structure.

Data:

Scanned historical maps.

3. Question answering about Research Data Management

Challenge Provider:

Dr. Auriol Degbelo & Dr.-Ing. Christin Henzen

TU Dresden

Programming language: Python

Max. number of Teams: 5

Challenge Description:

Researchers face various challenges at each of the phases of the research data management lifecycle (data creation, data processing, data analysis, archiving, sharing and re-use). Depending on the nature of the challenge, they may turn to online platforms (e.g., Stackoverflow, GitHub, Quora, …) or peers to seek answers. While online platforms and existing search engines may provide help at times, they fall short because either they are (i) general-purpose and/or (ii) they return websites that researchers need to go through manually but not actual answers. A web-based system where they can ask their questions related to research data management easily and receive meaningful answers to these could yield productivity gains while also contributing to knowledge exchange among peers.

Goal:

The goal of the Hackathon is to build a web-based question-answering system to assist researchers in finding answers related to research data management. Researchers ask their questions to the system in natural language and receive answers to their questions in natural language as well

Data:

A series of questions and answers related to RDM following the format Q/A/S (Question/Answer/Source). The dataset will be provided as a .txt file.

4. Complement the Research Data Commons Mediation Layer

Challenge Provider:

Sebastian Beyvers, Jannis Hochmuth, Lukas Brehm & Dr. Frank Förster

Justus Liebig University Giessen

Programming language: no specific programming language is required

Max. number of Teams: 3

Challenge Description:

The Research Data Commons (RDC) is a cloud-based research infrastructure that supports researchers, data providers, and data users [1]. It enables the creation of FAIR data products and collaborative data sharing within the NFDI and beyond. It supports biodiversity researchers and facilitates data reuse, correlation, and complex analysis to generate new insights, while adhering to FAIR principles with a computational API.

The RDC is composed of four different layers. Its foundation is a data agnostic Cloud Layer. It serves as the technical backbone, offering scalable distributed computing and ample cloud storage for user-friendly data analysis. It is overlayed by the Mediation Layer, which facilitates the creation of FAIR data products and manages metadata and data transformation for improved data quality as a domain-agnostic layer. On top of it is the Semantic Layer. It provides domain-specific data products that adhere to FAIR principles and offers tools for easy access and understanding of data schema. Finally, build on top of the three other layers is the Application Layer. This layer consists of concrete applications and services for end-users, including domain-specific and domain-agnostic tools, accessing data from various layers.

Aruna Object Storage (AOS) [2,3] is used as the primary storage within the cloud layer. It provides an application programming interface (API) and an S3-compatible interface for accessing its stored data and metadata.

Our challenge is to develop demonstrators for validation and transformation of data stored in AOS to create a template set for further completion of our mediation layer.

[1] https://doi.org/10.5281/zenodo.3943645

[2] https://aruna-storage.org/

[3] https://github.com/ArunaStorage/Goal

Goal:

Attempts will be made to achieve the following goals:

Development of a demonstrator, for data validation. This demonstrator must report back at the end of execution in AOS whether the standards for the data type are met.
Development of a demonstrator, for data transformation. This demonstrator shall support the transparent conversion of a valid data set to another format.

5. DWD Data Cube challenge

Challenge Provider:

Dr. Felix Cremer & Fabian Gans

Max Planck Institute for Biogeochemistry

Programming language: Julia or Python

Max. number of Teams: 3

Challenge Description:

Join us to learn about data handling using geospatial data cubes in Julia or Python.

Biodiversity and species distribution is commonly assessed in relation to different climate variables. This relationship is generally researched using well-established datasets like WorldClim or Chelsea Climatology, which provide global interpolated features derived from long time series.

During this hackathon we will derive bio-climatological features from a regional high-resolution weather reanalysis product and compare the results to the same regional cutout from the WorldClim dataset. We will then investigate the influence of the different climatologies when trying to link the to species distribution maps. Last but not least participants can try to abandon the view of a “static” climatology and estimate trends in the derived features and estimate the effect these might have on future species distributions.

Goal:

Introduction into the work with data cubes
Understanding how to access data from different sources
Combining vector data and data cubes
Chunking and access patterns for different data

Data:

DWD data cube with reanalysis data for europe
Vector data for biodiversity indicators
WorldClim or <https://chelsa-climate.org/> data
<https://www.worldclim.org/>

Data Cube logo: modified version of Lucasbosch, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=99759940

6. Develop a user friendly and feature-rich meta-data annotation tool for metaproteomics experiments

Challenge Provider:

Kay Schallert & Jun.-Prof. Dr. Robert Heyer

Leibniz-Institut für Analytische Wissenschaften (ISAS)

Programming language: TypeScript/Angular or JavaScript.

Max. number of Teams: 4

Challenge Description:

LC-MS/MS-based metaproteomics characterizes taxonomic and functional microbiome composition in health, environment, and biotechnology. In general, metaproteomics abundance correlates quite well with the actual activity. Furthermore, it detects proteins from the environment (e.g., host protein in the human gut or substrate proteins in biogas plants).

The results of metaproteomics searches may be in different file formats, making them difficult to combine and analyse together. To allow data to be FAIR (findable, accessible, inter-operable and reusable) it is important to add metadata. A web service shall be developed (starting from a minimal prototype) to introduce metadata to metaproteomics data and convert various data files into annotated standard data files.

Goal:

Starting from a minimal prototype, a user-friendly web service will be set up to transfer them into standard formats (mzML and mzID) and enrich them with further information and metadata (sdrf). The focus of the challenge will be on a rich and user-friendly web interface that seamlessly integrates standards and controlled vocabularies from various sources public.

Data:

Metaproteomics spectrum data files and search result files (multiple formats), user input through the interface.

Who can participate?

The hackathon mainly adresses early career researchers (doctoral and postdoctoral) from the fields of earth system sciences, biodiversity and microbiology, but we also strongly encourage participation from other fields, such as informatics, computer sciences and bio-/media-informatics. More experienced individuals are also welcome to join.

How does the registration work?

Please use this link to register for one challenge and/or the participation in our in-person event in Frankfurt am Main on Dec 5th – 6th. It is possible to register as individual and as a team of max. 5 people. Important is that each person needs to register individually and that there two different forms for the challenge and the in-person event respectively. Please note, that spots in the challenges and in Frankfurt are limited. The maximal number of teams per challenge can be found in the respective challenge description. For the in-person event in Frankfurt, priority will be given to people participating in a challenge. Registration is open until October 1st 23:59 h.

What is the timeline?

The hackathon starts with a virtual kick-off event on October 23rd 2023. Afterwards, teams will have until December 5th to work on their challenges. Finally, the project results will be presented on December 5th – 6th 2023 in Frankfurt, with an option for online participation. Details for online participation will be later published here. We welcome all interested people to join us online for the kick-off and the project presentation. Spots for the in-person event are limited and require a registration. Priority will be given to participating teams.

Kick-Off Event

The virtual kick-off will take place on October 23rd 2023 (10:30 – 15:00).

Venue: We will meet via Zoom. Please register via this link: Registration Kick-Off Event

Preliminary Agenda (10:30 a.m.- 3:00 p.m.)

Welcome
Overview Challenges
Introduction of participants
Hackathon schedule & Technical introduction
Lunch
Challenge specific breakout rooms and discussion with challenge provider

Final Presentations

The project results will be presented on December 5th – 6th 2023 in Frankfurt. Further, there will be a networking event and a supporting program. Online participation on December 5th is possible.

Venue: Flemings Hotel Frankfurt-Central, Poststrasse 8, 60329 Frankfurt am Main.

For online participation please register via this link: Registration final event online participation

Preliminary Agenda

Day 1 – 05.12		Day 2 – 06.12
		09:00 – 12:00	Networking Session
12:00 – 13:00	Arrival/ Registration	12:00 – 13:00	Lunch/ Good-bye
13:00 – 13:10	Welcome
13:10 – 13:45	Keynote
13:45 – 15:15	Project presentations Ch.2 & Ch. 3
15:15 – 15:45	Break
15:45 – 17:15	Project presentations Ch.5 & Ch. 6
17:15 – 17:30	Break
17:30 – 18:30	Wrap-Up/Outlook
from 18:30	Dinner
From 20:00	Torch Museum tour at Senckenberg Natural History Museum

Further information: Room contingents at Flemings Hotel are reserved for participants. You will receive further information when your registration is confirmed.

Note: The costs of coffee breaks, dinner (5th Dec.), and lunch (6th Dec.) are covered. Any additional consumption needs to be covered by the participants.

Contact and information possibilities

If you have any questions, please contact us via e-mail: nfdi4earth-academy@groups.tu-dresden.de