Copyright law often permits scientific analyses of large, contemporary text collections, but it blocks many of the open‑science practices that make research transparent, reproducible, and reusable. Derived Text Formats (DTF) solve this problem being automatically transformed versions of the original texts in which copyright‑protected material has been removed, yet the parts that are needed for e.g. Digital Humanities (DH) and Natural Language Processing (NLP) remain intact. The transformed texts can then be shared freely with other scientists.
How DTF are built
Four basic operations are used, each of which can be applied at different levels of granularity and scope (e.g. a single word, a whole sentence, a paragraph, an entire work, or an entire corpus):
|
Operation |
What it does |
Example use |
|
Delete |
Removes selected pieces of text. |
Delete all spoken lines from a play; the remaining text is no longer protected, but can still be used for network analysis or language modelling. |
|
Substitute |
Replaces selected pieces with something else (e.g., a placeholder). |
Replace every proper name with “NAME” for anonymisation. |
|
Retain |
Keeps only the parts that are needed and discards everything else. |
Keep only the count of each word (“bag‑of‑words”); this list of frequencies can be used for authorship attribution. |
|
Randomise |
Changes the order of larger units such as sentences. |
Randomly shuffle the sentences of a large corpus; if the corpus is big enough and the individual sentences are not themselves protected, the resulting text is considered copyright‑free. |
Why DTF matter
By providing a structured, legally safe representation of texts, DTF allow researchers from linguistics, digital humanities, language technology, and any other field work with the data they need without violating copyright. In short, DTF make it possible to share and reuse text‑based research material while staying within the law.
More here:
Other posts
Posts
QualidataNet by KonsortSWD-NFDI4Society is the “central point of entry” for qualitative data and its secondary use.
QualidataNet – Making Qualitative Research Data Visible and Reusable
QualidataNet is the central access point for the reuse, archiving, and research data management of qualitative research data. Its search portal improves the visibility and discoverability of qualitative datasets from different providers. Through practical guidance, tools such as the open-source anonymization tool QualiAnon, and contributions to international metadata standards, QualidataNet supports researchers, educators, and institutions working with qualitative data. At the same time, the network fosters collaboration, exchange, and a stronger culture of qualitative data reuse across the community.
Forum4MICA – Making Information Commonly Available (KonsortSWD I NFDI4Society)
Forum4MICA – Making Research Data Knowledge Accessible Together
Forum4MICA connects researchers and research data centers on one central platform. It provides a space to ask questions, exchange expertise, and discuss complex datasets from the social, behavioral, educational, and economic sciences. Through direct interaction with experts and the research community, the platform is building a sustainable knowledge archive for research data management and scientific collaboration.
New class: Infrastructure for Mathematics (MaRDI)
This course discusses the concepts and services that mathematics students often only learn about during thesis writing or by word of mouth. When introducing research data management topics, we focus on how to do maths efficiently and sustainably.
Recent Comments