Lecturer: Peeter Tinits (University of Tartu)
Date: 23 August
Room: Small Conference Hall
Description
R is a scripting language often used for data processing in humanities and social sciences. It provides the means to produce analyses as a reproducible workflow that is transparent to readers and easy to update. We will start with the very basics of R and RStudio, and quickly work our way through to simple data processing via Tidyverse packages. Tidyverse is a set of packages that aims to make R easy to use, especially for beginners. We will learn basic R syntax, data manipulation and overviews in Tidyverse style.
We will rely on personal laptops in this tutorial, you will need to install R (https://www.r-project.org) and RStudio (https://www.rstudio.com) a few days beforehand. Short instructions will be shared.
If you do not have previous experience in R, this workshop is a requirement for attending other workshops using R at this summer school.
References:
Grolemund, Garrett, and Wickham, Hadley (2017) R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.
About the instructor
Peeter Tinits is a digital humanities specialist in the University of Tartu, and teaches various digital humanities courses. His own research has been on spelling standardization of Estonian, the rise of environmentalism in the 20th century, and structural changes in film production crews. He is a firm believer that anyone can learn to code, and the humanities have a lot to gain from adopting reproducible research practices.
Lecturer: Kristiina Vaik (University of Tartu)
Date: 23 August
Room: Corner Hall
Description
This workshop aims to introduce an alternative programming language used in natural language processing - Python. Python has a simple syntax and transparent semantics and is widely used for analyzing, understanding, and deriving information from structured and unstructured data. This course will start with a basic introduction to Python, we will quickly go through topics such as syntax, variables, data structures, conditionals, loops, and IO. We will continue with an introduction to Pandas, a powerful Python data analysis toolkit used for data exploration and manipulation. Finally, I will introduce spaCy and Stanza. Both are free open-source libraries with many built-in capabilities for text processing, e.g, noise removal, tokenization etc. Additionally, we shall see how to apply spaCy's and Stanza's pre-built models for different downstream tasks, e.g, morphological and syntactical parsing, named entity recognition, etc.
I will provide the students with Jupyter notebooks containing the code used in this tutorial. This course will assume basic knowledge (syntax, what are data types and variables) in Python and build on that. I recommend using your laptop, instructions on what packages to download will be shared beforehand.
About the instructor
Kristiina Vaik is a Ph.D. student at the University of Tartu. She has worked as a programmer in the Natural Language Processing Research Group at the University of Tartu and as a data analyst at TEXTA.
Lecturer: Niko Partanen (University of Helsinki)
Date: 23 August
Room: Auditorium 2.0
Description
The publication of research data has become increasingly commonplace and required in recent years, but many of the related practices are still evolving, and there is a great deal of variation between scientific fields. There are, however, some basic principles we can aim to follow in order to ensure that our materials are organized in a way that is suitable for our research use and for further distribution. In this workshop, we will use FAIR principles as a general guideline, but we will focus primarily on practical and tested solutions that are available at the moment, while reflecting them in the wider context of FAIR.
In the workshop, we will go through the basic premises of dataset creation and documentation, with a focus on general project management and organization. We will also discuss issues related to good documentation and reproducibility in a wider context. We will study version control, and use GitHub's Zenodo integration as one publishing mechanism. We will also discuss the possibilities for storing closed datasets, and how we generally have to approach the licensing and reuse of different materials.
At the end of the workshop, each participant will be familiar with essential data preparation and publication methods that can be used at the moment. We will also reserve time to discuss the nuances of individual datasets the students have been working with.
About the instructor
Niko Partanen is a PhD student at the University of Helsinki. He has been working for the last ten years with documentation and description of endangered Uralic languages spoken in Russia, especially the varieties of Komi. He has also worked extensively with language technology and natural language processing, with a particular focus on integrating these methods into language documentation workflows. Partanen collaborates with a variety of archives in the digital preservation of legacy data.
Lecturer: Andres Karjus (University of Edinburgh)
Date: 24 August
Room: Small Conference Hall
Description
Big data is everywhere, holds unprecedented potential for humanities and social science research, and in general for the understanding of our complex ever-changing world. But understanding big data is hard. Unless you have the right tools. In this workshop, we’ll be exploring and dissecting various real world datasets using R, an excellent programming language for doing anything related to stats and data science. If you've never written code before in your life, this is your opportunity to learn (this superpower) through practical exercises with clear outcomes, primarily in the form of visualizations. We will mostly be using the ggplot2 R package and its addons, starting out with basic examples like scatterplots and time series. We will also look into a few other packages for creating things such as networks and maps, as well as interactive and animated plots. But, with great power comes great responsibility: so we will also spend some time discussing the ethics of data visualization, and approaches to making sure your graphs don't mislead your audience.
About the instructor
Andres Karjus is a research fellow at the ERA Chair for Cultural Data Analytics (CUDAN) at Tallinn University. He obtained his PhD in evolutionary linguistics from the University of Edinburgh in 2020, and holds degrees in linguistics (BA, MA) and computer science (MSc). He uses R daily in his research and has been teaching R workshops and courses since 2015.
Twitter: https://twitter.com/AndresKarjus
Personal website: https://andreskarjus.github.io
Lecturer: Tuomo Hiippala (University of Helsinki)
Date: 24 August
Room: Auditorium 2.0
Description
This workshop introduces key issues in doing research using social media data. We will discuss the kinds of research questions that may be pursued using social media data; map access to data as of 2021; consider ethical questions related to social media research; and experiment with applicable computational methods from the fields of natural language processing and computer vision. The workshop involves some programming in Python using Jupyter Notebooks, an interactive environment running in a web browser.
Recommended readings
-
Toivonen, T. et al. (2019) Social media data for conservation science: A methodological overview. Biological Convervation 233, 298–315. Openly available at: https://doi.org/10.1016/j.biocon.2019.01.023
-
Hiippala et al. (2020) Mapping the languages of Twitter in Finland: richness and diversity in space and time. Neuphilologische Mitteilungen 121: 12–44. Openly available at: https://doi.org/10.51814/nm.99996
About the instructor
Tuomo Hiippala is Assistant Professor in English Language and Digital Humanities at the University of Helsinki, Finland. His research interests include multimodal communication and urban multilingualism.
Lecturers: Leo Lahti (University of Turku), Iiro Tiihonen (University of Helsinki)
Date: 24 August
Room: Corner Hall (online only)
Description
The Workshop is an introduction to the computational harmonisation of bibliographic metadata.
Bibliographic metadata is a valuable source of historical and cultural information. However, it’s often the result of a long and shifting process, resulting in data coded with various differing notations and conventions. Its full utilisation as cultural heritage or research material is often impossible without harmonisation - the process of standardising, converting and cleaning the data. Using R and focusing on Estonian and Finnish bibliographic metadata, the workshop aims to motivate the importance of harmonisation and to demonstrate its application in practice.
As harmonisation is a vast and often content specific topic, we aim to combine general level motivation about the often fundamental role of harmonisation with practical examples based on real bibliographic data and workflows used in practice. We provide an overview of the process used to harmonise the historical bibliographic metadata of the Finnish National Library (Fennica) and a more focused hands on example using Estonian bibliographic data.
About the instructors
Iiro Tiihonen is a PhD student elect of history at the University of Helsinki. He has a background both in the humanities (M.A, history) and data analysis (M.Sc, applied mathematics) and his academic focus is on the application of bibliographic metadata to quantitatively study the early modern period.
Leo Lahti is associate professor in data science & computational humanities at the University of Turku and long-time member of Helsinki Computational History Group. Lahti got his doctoral degree in machine learning / bioinformatics from Aalto University, Finland, in 2010. The current research of the team focuses on computational analysis of complex natural and social systems. More information at the research homepage: datascience.utu.fi
Lecturer: Bodo Winter (University of Birmingham)
Date: 25 August
Room: Small Conference Hall
Description
In this workshop, we’ll be learning to use brms to fit linear models and linear mixed effects models in a Bayesian framework. As many different fields (including linguistics, psychology etc.) are moving away from thinking about data analysis in terms of significance tests, this workshop prepares you for the future. The workshop will teach you about two things: First, the fundamentals of statistical modelling, with a focus on how to interpret linear models. Second, the fundamentals of Bayesian inference. Instructions and materials will be released closer to the start of the workshop.
About the instructor
Bodo Winter is a UKRI Future Leaders Fellow, a Senior Lecturer at the University of Birmingham, UK, and Editor-in-Chief at the journal Language and Cognition. He has written a textbook “Statistics for linguists: An introduction using R” and has extensive experience running workshops on statistical modelling.
Twitter: https://twitter.com/BodoWinter
Personal website: https://bodowinter.com/
Lecturers: Osma Suominen, Mona Lehtinen, Juho Inkinen (National Library of Finland)
Date: 25 August
Room: Corner Hall (online only)
Description
Many libraries and related institutions are looking at ways of automating their metadata production processes for example through the adoption of AI technology. In this hands-on tutorial, participants will be introduced to the multilingual automated subject indexing tool Annif (annif.org) as a potential component in a library’s metadata generation system. By completing exercises, participants will get practical experience on setting up Annif, training algorithms using example data, and using Annif to produce subject suggestions for new documents using the command line interface, the web user interface and REST API provided by the tool. The tutorial will also introduce the corpus formats supported by Annif so that participants will be able to apply the tool to their own vocabularies and documents.
The tutorial will be organized using the flipped classroom approach: participants are provided with a set of instructional videos and written exercises, and are expected to attempt to complete them on their own time before the tutorial event, starting at least a week in advance. The actual event will be dedicated to solving problems, asking questions and getting a feeling of the community around Annif.
Participants are instructed to use a computer with at least 8GB of RAM and at least 20 GB free disk space to complete the exercises. The organizers will provide the software as a preconfigured VirtualBox virtual machine. Alternatively, Docker images and a native Linux install option are provided for users familiar with those environments. No prior experience with the Annif tool is required, but participants are expected to be familiar with subject vocabularies (e.g. thesauri, subject headings or classification systems) and subject metadata that reference those vocabularies.
Workshop materials
Exercises and introductory videos can be found in the Annif-tutorial GitHub repository.
The tutorial materials have been created in collaboration with Anna Kasprzik and Moritz Fürneisen of ZBW - Leibniz Information Centre for Economics in Germany.
About the instructors
Osma Suominen works as an Information Systems Specialist at the National Library of Finland. He is the original developer of Annif and is currently leading the automated cataloguing project where Annif is being developed and deployed. He has a doctoral degree in Media Technology (Aalto University) and has a long experience with semantic web technologies, vocabulary services and metadata processes.
Mona Lehtinen is an Information Specialist at the National Library of Finland. She works with the Annif project and is happy to tackle various tasks such as project coordination, community and corpora building and testing the new features of Annif.
Juho Inkinen works as an Information Systems Specialist at the National Library of Finland. His tasks include developing Annif and taking care of the Annif instances hosted on the sites of the National Library.
Lecturers: Mahendra Mahey (GLAM Labs)
Date: 25 August
Room: Auditorium 2.0
Description
This workshop attempts to help you take some pragmatic steps towards how your research question(s) and related data/digital collections may be examined, analysed and solved ‘computationally’.
You will get an opportunity to tell the group about your research question and present your data. We will then collectively give you some feedback and suggest some practical steps in moving your project forward, breaking things down into manageable steps and examining if any could harness the power of computation. You will then get a chance to start on this journey in this workshop and report back later on any progress or challenges faced.
Don’t have your own data? No problem! We can ensure you get ‘hands-on’ experience of working with cultural heritage data and digital collections and provide some of the typical challenges so that you get a chance to safely experiment and play.
Mahendra will examine and discuss challenges he has faced when experimenting with digital collections through different kinds of experiments in exploring, finding patterns and making new discoveries within data through hundreds of digital projects. He will also invite the workshop participants to contribute their wealth of experience in providing feedback too.
The workshop will conclude with reflections from the delegates and feedback on how to move forward in your enquiry and learn the power of thinking computationally for future research problems and contexts.
See a more detailed workshop description here.
About the instructors
Mahendra Mahey has a background of working with people, digital technology and data as a manager, educator, adviser and community builder in Cultural Heritage, Further and Higher Education for researchers, educators, librarians and businesses both in the UK and internationally.
For the last 8 years he has been helping scholars, artists, entrepreneurs, educators and innovators to work with cultural heritage data while working as the manager of British Library Labs. He has worked with colleagues to bring national, state, university and public Galleries, Libraries, Archives and Museums (GLAMs) together who are planning and already have digital experimental ‘Labs’. The GLAM Labs network aims to share expertise, knowledge and experience in order to build better ‘Labs’ for their organisations and users. Personal website: http://mahendramahey.com/
Date: 26 August
Room: Main Conference Hall (only on-site)
Description
This year, we will be hosting a student research session on the last day of the summer school. During the session, each participant will have a chance to present their own work. Two presentation formats are available. You can either introduce your project and brag about a nice solution you came up with to solve a difficult problem (which might be useful for somebody else as well), or, you can present an unresolved problem to the audience for public troubleshooting and brainstorming. The session will be moderated by invited interdisciplinary researchers with expertise in various qualitative and quantitative methods.
There will be no workshops on the final day so everyone can join in on the discussion! 1 ECTS will be awarded to those presenting their work (in either format). However, we encourage everybody to come join the audience and participate in the brainstorming. The details of this event will depend on the number of interested participants; if you are interested in presenting, please indicate so in the registration form and we will contact you with further information.
If you have any questions about this event, please contact Mariann Proos: mariann.proos@ut.ee