How did you teach yourself data science

Data Science for Everyone: Basics of Data Programming

Summary

The demand for data scientists in the most diverse areas of industry, society and research confronts universities with the question of the form in which data science training should be made possible. In addition to the traditional approach of offering data science as a subject, there are also demands for data science events to be embedded in subjects other than computer science and mathematics in order to meet the increased demand for data skills in these areas. This is also supported by the “Data Literacy” initiative recently funded by the GI. Against this background, we designed and successfully implemented a data science course at the bachelor level at the TU Berlin based on the model of the Data8 course at Berkeley University in California. The course "Data Science 1: Essentials of Data Programming" teaches the basics of programming, statistical data analysis, machine learning and ethical questions in the application of these methods. The offer met with very strong interest on the part of students from a wide variety of courses at the TU Berlin, including art history and philosophy. Successful implementation of the course included not only the appropriately designed integrated synopsis, which conveyed mathematical concepts and programming techniques based on case studies, but also regular practice hours and homework as well as a centrally managed JupyterHub infrastructure that protected both non-computer science students from any installation of unknown software as well as automating the correction of programming homework. In this article we would like to report on our findings on how we succeeded in getting students with very different computer science skills to be interested in data science. We go into the practical implementation of the course and the final performance review. Finally, we show the advantages of such a course. This includes the scalable possibility to impart data skills to large parts of the students and to provide lateral entry into computer science.

From data literacy to data science

On job portals such as LinkedIn, there are now more jobs for data scientists than for classic IT specialists, and these jobs stay on the market for a week longer than the averageFootnote 1. Despite its current popularity, the term “data science” is not precisely defined. Depending on the academic perspective, different priorities are set with regard to the sub-subject areas of statistics, computer science and economics. Accordingly, the associated basics have not yet been clearly defined. Companies are usually looking for university graduates who can demonstrate core skills in computer science, in particular programming, scalable data processing and machine learning with a mathematical foundation, in particular for statistical data analysis and model development, as well as domain knowledge in an application and communication and presentation skills. In addition, there is a growing interest in familiarizing students with subjects other than computer science and mathematics with basic methods of data science. In this context, so-called information skills have already been defined [1]. The GI also has a new focus on so-called "data literacy"Footnote 2 set, which summarizes digital competencies.

In order to support this initiative, we at the TU Berlin have a planned “Data Literacy” certificate, which students from computer science and mathematicsstrange Courses in addition to their degree, a new basic course is offered and successfully carried out.

Inspired by the “Data8” course, which is offered to all students at the Data Science Institute at Berkeley University in California and in close collaboration with your lecturers, we designed the “Data Science 1: Essentials of Data Programming” course, which is based on Students with no programming experience whatsoever and the basics of algorithmic handling of data are dealt with at a suitable level for students from various fields of study. Our initiative for this was viewed with great skepticism at the beginning, as such a course has to overcome several challenges. In particular, it was seen as unrealistic to teach programming and math within a course. Furthermore, it is a great challenge to select a level of abstraction for which only the university entrance qualification is sufficient. Ultimately, such an initiative, which is aimed at the large number of students, requires a sensible scaling concept. It is not only about the frontal communication of concepts, but also to apply them practically through regular hands-on exercises, which in turn require intensive supervision.

We have already successfully carried out this course once and would like to report on its structure and implementation in this article. The success of this course can be attributed to the level of abstraction effectively selected and the technical infrastructure built up for practical programming exercises and tutorials. The usual teaching is inverted. Students first learn how to use tools and methods in practice, whereas theoretical concepts are only deepened afterwards. Specifically, students learn to deal with practical issues and use the centrally maintained Jupyter notebook environment to concentrate on the actual programming tasks.

In the following we will first describe the course content and the course structure. Then we will use statistics to shed light on the extent to which the course met the self-set goals of effectively conveying data literacy to a diverse group of students and enabling future opportunities for lateral entry into data science.

Course content

The synopsis of the module "Data Science 1: Essentials of Data Programming" covers central data science fundamentals: causality and correlation, big data, data extraction, data visualization, random variables, comparison of samples, hypothesis tests, estimation and prediction of test variables, classification and ethical Questions in data science. At the same time, the students, many for the first time, should learn the concepts of programming. This means that, parallel to the respective mathematical concepts, we had to convey programming concepts such as data types, variables, assignments, control structures and data processing functions step by step. For the course content, we roughly followed the online book "Computational and Inferential Thinking"Footnote 3 oriented, but expanded with thematic blocks on big data and ethics.

First of all, it was important to us to inspire the students of the course for the topic of data science as such.

It is important to mention here that this cannot always be done with a diverse student landscape in view of the excellent professional prospects for data scientists. Although one of the goals is to motivate people to move into data science and computer science, it was important to us that the students discover the relevance of the methods for their own fields of study. Therefore a central element of every lecture was the relation to practice and the application of the conveyed concepts to real data sets. We have data sets on a wide variety of topics from life and science such as For example, data on traffic around the TU Berlin, historical events, the American census and health issues are used to help. For example, the students deepened the concepts of prediction by using Bitcoin price data to predict the Ethereum value or estimate the age of the universe. In another example, they used the bootstrapping method to scrutinize the distribution of professional athletes' salaries.

In the first weeks of the course, the necessary basics in mathematics and computer science were taught in order to get all students on a common level. In particular, the necessary terminology was conveyed. There was an introduction to programming with Python. Even these steps were conveyed in a practice-oriented manner. First, the students learned to perform simple operations on tables. The programming tasks for these operations included data extraction, processing, and visualization. The students also learned to select the right types of diagrams for categorical and numerical data and to interpret their contents.

Armed with these basics, we then dealt with the basic concepts and topics of statistics such as the concept of probability, random variables and calculating with probabilities. We then discussed and compared different methods for obtaining samples of large populations. Only here did students learn how to use loops in Python.

Building on the basics of statistics and the powerful control structures of the Python programming language, we dealt with the next topic of simulations, statistical tests and significance values ​​for clear use cases.

In the following weeks, the students got to know the theory of the bootstrap method and also learned to use this method to estimate parameters of a population of which only samples are available. In addition, we treated confidence intervals with which, building on the knowledge of the p-Value, hypothesis tests can be answered and justified and statements about the quality of estimated values ​​can be made. Here we then delved deeper into the matter and presented Chebyshev's inequality and the central limit theorem.

As a result, the students learned how to use data to make statements about the future. In this way, participants learned to calculate the correlation coefficient in order to measure linear relationships between variables. However, we also highlighted the limits and requirements for meaningful work with statistical quantities such as the correlation coefficient. Building on this, we discussed how the correlation coefficient can be used to calculate a regression line for the relationship between variables. To do this, we followed Sir Francis Galton's experiments, which laid the foundation for the regression equations. We went on to deal with the method of least squares in order to calculate the most precise possible regression line for corresponding scenarios.

When dealing with the topic of classification, we firstly presented the k-nearest neighbor algorithm and also illustrated the effects of “overfitting” and “underfitting” that can occur if the neighborhood size is unsuitable. We also talked about how classification models can be trained and how their precision can be measured using test data.

Finally, we discussed the social relevance and sensitivity regarding the acquisition of knowledge from data. In doing so, we dealt with the deception and deception that can arise from poor data quality, prejudices and unknown disruptive factors.

In summary, the course "Data Science 1: Essentials of Data Programming" conveyed the central statistical concepts that are relevant for modern data science. In addition, the students were taught the basics of programming with Python. Finally, with the topics of classification and predictions, we gave an outlook on other exciting areas of specialization in data science such as machine learning, and hope that we have aroused sustained interest in these topics among the participating students.

Teaching methods and digital teaching aids

Our goal was to convey the aforementioned course content to a diverse and large audience. For this we had to use a technical solution for the scaling of the practice and homework operations. In the first attempt, we admitted 150 students.

Event offer

The course consisted of two different types of event. On the one hand there was the lecture given by Prof. Abedjan. Theoretical concepts were primarily conveyed and demonstrated and applied using programming examples in Jupyter notebooks. The lecture materials were based on the text book “Inferential Thinking” from Berkeley University and were made available to the students after the event via the e-learning platform ISIS of the TU Berlin. After the first lecture, the 150 participating students were divided between the four hours of practice offered.

The practice hours were thus held for around 40 students each by the doctoral students Mahdavi and Esmailoghli. The aim of the tutorials was to apply the material from the lecture in a practical way and thus to deepen it. For this purpose, exercises were processed and discussed together live and on the Jupyter notebook. In addition, the practice sessions were a well-used opportunity, especially at the beginning of the semester, to discuss any Python-related questions and solve problems. In principle, the exercises were always well attended and the discussions were very well received and enthusiastically received by the students. Since tasks were managed centrally and all students worked in the same development environment, there was a very high level of dynamism and no technical interruptions.

JupyterHub infrastructure and homework

To distribute the course materials and as a central infrastructure element, we set up a JupyterHub server for this course. JupyterHub made it possible for us to offer the students a uniform programming environment. Thus, the participants were spared the installation and configuration of any software and were able to log into the JupyterHub platform directly via the browser and work with a development environment that was available for use. Such an environment saves students and teachers a lot of time and frustration, especially for non-IT students who have not yet had any programming experience.

Fig. 1 shows the architecture of our infrastructure. In contrast to the installation at UC Berkeley, for reasons of data protection and general maintainability, we did not build our system on Kubernetes, as our server would have to be hosted by Google. Instead, we installed the server locally on a machine with 256 GB main memory and 28 cores based on Dockers. In doing so, we were guided by the experience of the University of Versailles, which has set up a similar infrastructure. We have made a detailed documentation of the installation available onlineFootnote 4.

On the JupyterHub server, every student had access to an isolated workspace, where the course materials, which were provided centrally via a GitHub repository, were always updated and available. In addition, students could create their own notebooks here in order to experiment and deepen the course content.

We also published a total of seven digital homework sheets during the semester. A maximum of 100 points were to be achieved per homework sheet and in order to be admitted to the written exam at the end of the semester, a total of at least 50%, i.e. 350 homework points, was necessary. However, the homework points did not contribute to the final grade of the course. The homework sheets were also made available and distributed via our JupyterHub infrastructure. For this we used the JupyterHub extension "nbgrader", which enables the necessary tools for the automated output and control of the tasks.

This extension allows you to create worksheets with different types of cells. The usual markdown and program code cells, which are also used as standard in Jupyter notebooks, served here as the processing area for the specific tasks. In addition, write-protected cells can also be created, which we used to import the necessary libraries, give examples for the students and present the tasks. Assessment cells can also be inserted. These cells are not visible to the students and contain tests to verify the information provided by the students. They can also be given a score, which is assigned to the student if all tests within this cell were successful. After the processing time, which was usually 2 weeks, we automatically collected the exercise sheets with the help of the "nbgrader" extension and started the assessment scripts, which executed the exercise sheets, checked the assessment cells contained therein and credited the achieved points to the students.

This method saved us a lot of correction time with over 1000 submissions (7 homework sheets for around 150 students), since the preparation of the homework and the associated tests were usually completed in one or two working days. The students then received their corrected homework back for inspection and were able to compare them with a sample solution provided.

Exam

Finally, at the end of the semester, the written exam took place, which examined the entire material of the semester. In the week before the exam, a summary of the topics covered was given in the lecture on the one hand, and some sample tasks on the exam level were presented and worked on on the other. In the exercises this week, repetitions were offered again and final questions were clarified. In the exam, which was to be worked on with pen and paper, mostly short Python programs were asked that applied the theoretical concepts that were dealt with in the course. For this purpose, the students were provided with a “cheat sheet” on which the most important Python functions were documented. Although we expected a certain degree of care from the students, we did not punish every syntax error in the programs with a point deduction, but placed much more emphasis on understanding and the correct application of the concepts.

In the end, we carried out the first course at the TU Berlin, which used a JupyterHub infrastructure for teaching, and we can draw a positive conclusion from this. The students were able to access a prepared programming environment and only needed a browser and internet access. This infrastructure also enabled us to automatically issue, collect and evaluate the students' homework. The infrastructure was quickly adapted by the students and we received very positive feedback here too.

Evaluation of the course

In the following, we want to discuss, based on the recorded course data and student evaluation forms, to what extent our event brought students closer to the topic of data science or motivated them to continue the topic. First, based on statistics on the participating students, the attractiveness of the course is discussed in general. In addition, we use evaluation sheets to help analyze the usefulness and degree of difficulty of the course for students in various fields of study.

Participant structure and course success

For reasons of capacity and for the controlled first implementation of the course, we decided to limit the course to only 120 participants. However, the initial expressions of interest exceeded 300 and we decided to take on another 30 students. In fact, we anticipated a decrease in the number of participants over time, which actually reduced to 120 by the end of the semester.

Fig. 2 shows the distribution of students and their degree program. In total, students from 40 different study programs took part in our course. In fact, a total of 30 students in computer science, business informatics and ICT took part, although the course was explicitly not recommended for them. Nonetheless, we can claim to have achieved our goal of attracting students to various programs. It is also interesting that the percentage of female students at 34% is congruent with the percentage of female students at the TU Berlin and significantly higher than the percentage in computer science (16% in the Bachelor and 9% in the Master).

If one also takes into account the results of a survey (Fig. 3), according to which more than 78% can imagine very well and 17% can imagine deepening the topic of data science in the future, one can be optimistic about the lateral entry potential. The result of a further survey, according to which 98% of the students rated the event as at least good (60% very good) (Fig. 4), can be interpreted as having a positive influence on this trend. This is also underpinned by the fact that 70% of the participants rated the course as absolutely useful and a further 27% as rather useful (Fig. 5).

Evaluation of the course structure

As explained before, the course consisted of one lecture and one exercise per week. There were a total of seven homework assignments in the entire semester. To show the usefulness of this structure, we can consider the correlation between average exercise success and the results of the final exam.

Before doing this, it is interesting to note that the students themselves tended to find the content of the course difficult, as can be seen from the student response distribution in Fig. 6.

In practice, however, it turned out that the majority of the students, contrary to their expectations, were successfully guided through the content. Fig. 7 shows a heat map that relates homework points and exam grades. The x-axis shows homework points from 300 (minimum number of points for admission to the examination) to 700 (maximum number of points over seven homework). The y-axis shows grades 1 to 5. Each cell on the heat map indicates the number of participants with the corresponding total homework score and grade. Overall, a strong correlation can be observed, which shows a meaningful coordination of homework and exam.

The ideal adaptation of exercises, homework and exams was mainly possible thanks to the JupyterHub environment. This is also perceived as such by the students, as can be seen in Fig. 8. Over 74% of the participants confirmed the ease of use of the programming environment. Another 24% agree at least in part.

Participant structure and performance

Since the participant structure of the course was very diverse, we finally want to show some insights into observable tendencies regarding the participant groups. In particular, we want to check whether the level of the course was appropriate for this diverse structure.

Fig. 9 shows the distribution of the exam grade, divided according to the level of the course. Overall, the distribution patterns are very similar. Interestingly, there were significantly more Bachelor students who could achieve top grades of 1.0 and 1.3. In this respect, it can be argued that the study experience of master’s students does not represent a recognizable advantage. In fact, it can be assumed that most of the participants showed genuine interest and the ability to further qualify. This also applied to a small number of doctoral students who also took this course.

The next breakdown that we want to investigate is the conveyability of the content concepts with reference to the course of study of the participants. We also want to investigate the communicability of the different topics, programming, data processing, statistics and machine learning. Fig. 10 shows the average number of points for the sub-areas dealt with in the exam, broken down according to the student's affiliation with computer science, mathematics and other fields of study. In fact, the results from all three categories are very similar. We can see a consistently higher score among math students. Interestingly, the computer scientists themselves perform worse on average in the programming tasks. Our explanation for this fact is that the participants from the computer science were less prepared because they could have overestimated their abilities here. It is also possible that you did not take participation as seriously as the other participants, because the course is not in their compulsory catalog. The participants in other disciplines, on the other hand, had a serious interest in further qualification.

Conclusion

With our commitment to the realization of a subject-independent university data science course, we initially encountered a lot of skepticism. Understandably, the realization required the commitment and commitment on the part of the person responsible for the module and the doctoral students involved, which lay beyond their core teaching and research. However, due to the current relevance of the topic, we saw this as necessary to gain a better understanding of the entire field on the one hand and to convey computer science and data science topics to a larger audience on the other. The high participation rate of female students in particular acted as a promising signal. In summary, such a course requires three important components. First, it is necessary to select a synopsis and an abstraction level that is accessible to a wide range of students. Second, the course must be interactive and include tutorials to convey hands-on theoretical concepts. Thirdly, such interactivity requires a technical infrastructure that allows the participants to concentrate mainly on content concepts without having any previous knowledge of programming and IT.

literature

thanksgiving

The realization of this course would not have been possible without the active participation of the students of the TU Berlin. The regular participation and cooperation of the students has always motivated us. Furthermore, the course would not have been possible without the preparatory work of the Data Science Institute at UC Berkeley. We would also like to thank you for the exchange of experiences. Finally, we would like to thank the Vice President of TU Berlin for Teaching, Digitization and Sustainability, Prof. Dr. Hans-Ulrich Heiß, for the financial support of the course and Prof. Dr. Thank you Volker Markl for the technical advice.

Funding

Open Access funding provided by the DEAL project.

Author information

Affiliations

  1. TU Berlin, Berlin, Germany

    Ziawasch Abedjan, Hagen Anuth, Mahdi Esmailoghli, Mohammad Mahdavi, Felix Neutatz & Binger Chen

Corresponding author

Correspondence to Ziawasch Abedjan.

Rights and permissions

Open Access This article is published under the Creative Commons Attribution 4.0 International License, which permits use, copying, editing, distribution and reproduction in any medium and format, provided you properly credit the original author (s) and source, a link to Include a Creative Commons license and indicate whether changes have been made.

The images and other third-party material contained in this article are also subject to the named Creative Commons license, unless otherwise stated in the legend. If the material in question is not under the named Creative Commons license and the action in question is not permitted under statutory provisions, the consent of the respective rights holder must be obtained for the further uses of the material listed above.

For more details on the license, please refer to the license information at http://creativecommons.org/licenses/by/4.0/deed.de.

Reprints and Permissions

About this article

Cite this article

Abedjan, Z., Anuth, H., Esmailoghli, M. et al. Data Science for Everyone: Basics of Data Programming. Computer science spectrum43, 129-136 (2020). https://doi.org/10.1007/s00287-020-01253-8

Download citation