Projects

Below are descriptions of the projects for the current REU summer. Each one includes a description of the project and how the student will participate in the project.

Course Curriculum Assessment and Enrichment Project (Drs. Rangwala and Lester)

An enduring issue in higher education is student retention to successful graduation. Requiring additional terms or leaving college without receiving a bachelor’s degree has high human and monetary costs and deprives students from the economic benefits of a college credential. There is a critical need to develop innovative approaches to enable higher-education institutions to retain students, ensure their timely graduation, and train a workforce ready for their field. As part of a NSF-funded BIGDATA project (PI: Rangwala, Co-PIs: Lester and Johri), critical applications are being developed that help students choose academic pathways towards successful and timely graduation, aid instructors in improving their pedagogy, and assist advisors and institutions improve their retention rates.

Programs of study at institutions of higher education can be represented as a chain composed of the courses required to complete a degree. These component courses in turn are composed of the topics or concepts they are intended to cover. Evaluation of the courses within a particular program is necessary for the evaluation of an overall academic curriculum. Analyzing the structure of a programs prerequisite chain, for example, requires an understanding of each constituent course and any overlap of covered topics between courses and their prerequisites. Additionally, inter-institutional curricular comparison requires an aggregate evaluation of the courses within each institutions program. However, comparing and evaluating different courses requires expert knowledge in the relevant field. No two courses can be measured for similarity based only on inherent, measurable properties. A domain expert is required to inspect the course content and description and determine their conceptual overlap.

REU students will receive an overview of unsupervised machine learning methods, text processing techniques, and case studies with Python during the first weeks of the program. With these skills, they should be able to contribute to key elements to this project. REU students will develop a topic modeling-based approach to automatically infer key course topics from course descriptions and syllabus that are publicly available. Students will improve upon this model by seeking alternate information retrieval features that will rely on supervised learning from an external domain like Knowledge Graphs (e.g., DBPedia).

REU Students will also have the option to develop methods that connect the course concepts derived from text-based course information with career related data obtained from job-posting websites. By comparing course and job topics, students will determine the percentage of job opportunities whose topics are fully covered by an academic program which opens the opportunity to address the difference in academically provided skills and ones needed for a successful career trajectory. This combines heterogeneous sets of information sources for automated inference making and generating actionable changes to curricula based on analytical models.

MOOC Data Analytics and Visualization Project (Drs. Lin and Nelson)

Massive online open courses (MOOC) aim to deliver high-quality education to anyone with an internet connection. It is well known known that only a small percentage (1-5% on average) of students enrolled within a MOOC class participate in all its activities especially assessments. Additionally, MOOCs face difficulties guaranteeing assessments are fairly administered and properly awarding credit to students who complete the course. Successfully addressing these issues will increase retention of students within MOOCs and raise confidence in MOOCs as a viable educational vehicle thereby making low-cost, quality education more accessible to everyone. Several MOOC platforms have made the data associated with logs publicly available after anonymization. Mining of data associated with MOOC related data has exploded in the past year. The top data mining conference, ACM KDD hosted a data mining competition to predict student dropouts based on data from the largest Chinese-based MOOC, XuetangX. (See KDDCUP).

There is a critical need to identify MOOC students at risk of dropping early in the course so that MOOC instructors can intervene to improve retention. REU students will work on early warning systems that predict the whether MOOC students are likely to disengage from the course. REU students may also explore models to predict the performance of MOOC students on the assessments to enable efficient dispatch of personal help. The project will use data made available from OpenEdX, Coursera and NovoEd platforms via Datastage hosted by Stanford University.

There is also a need for visualization tools for irregular, large-scale sequential data generated from these systems. REU students will contribute to the development of a visual analytics platform which incorporates a variety of capabilities: query and display of individual learner history; display aggregation views that automatically reveal frequent and anomalous patterns across a group of learners; locate groups of similar learners based on subsets of attributes (subspace clustering), and contrast individual learners or groups of learners to compare those that successfully completed the course and to those who dropped. Students experienced with graphics and user-interfaces can contribute to the visualization project.

Scratch Data Analysis Project (Drs. Domeniconi and Johri)

Scratch offers a unique platform to learn how to write programs through the use of interactive blocks. No prior knowledge on programming is required. The majority of Scratch users are kids and teens. We have a limited understanding of how such an unstructured learning environments dynamically evolve over time. Particularly, how are communities of users formed, are those communities based on common interests, how do they evolve, when does a community go through a growth period, and when do communities split? To facilitate the study of online communities, MIT Media Lab has created the Scratch dataset from the Scratch Online Community. This dataset contains 1,056,950 registered users and 1,928,699 projects created between March 5, 2007 and April 1, 2012. The attributes associated with users include follower/followed connections, join dates, and number of projects. Project attributes include creation dates, types of code blocks used to create the project, and whether the project was remixed from others.

REU students will develop models of informal learning and examine patterns that will emerge. The identification of diverse learning trajectories for users can provide the opportunity for targeted intervention and support for a specific subpopulation leading to an improved experience for the community members. Novice REU students can develop methods that combine different features within a simple clustering algorithm, like k-means which will be discussed during the program’s initial tutorial. Advanced REU students with more machine learning experience can develop probabilistic graphical models or advanced co-clustering models for representing informal learning data. The project will leverage our prior work on such models that involved several undergraduates. The project follows the central theme of modeling learning behaviors using latent factor models in informal environments and provides insight into features of the learning environment that are suited for improved learning; in this case logic-based programming amongst kids.

Analyzing Online Engineering Communities Project (Drs. Johri and Sheridan)

One important avenue for technology-driven informal engineering learning is online communities of engineering and design enthusiasts. These communities have formed around a range of engineering topics and tools such as microelectronics, making, rapid prototyping, open source software and bring together engineering practitioners, hobbyists, and students, to engage in information sharing and product development (e.g. Do-It-Yourself projects). These sites called online engineering learning communities (OELCs) do not form with the primary intention of education. These communities have accumulated rich sources of materials and user data and can form the backbone to understand and implement technology-based informal engineering learning. Research to understand what works and what does not can accelerate innovations in cyber-enabled learning within engineering education. OELCs are important sites for learning as they are informal in the manner in which they are structured and rather than being hierarchical based on age or tenure; the hierarchy is informally established over time based primarily on expertise. Also, people participate in these communities due to their affinity and form communities of interest or, in many cases, communities of practice. Finally, an important aspect of these OELCs, is that people assemble here not only to discuss and share knowledge but to design knowledge-based products and services.

REU students will focus specifically on analyzing formum data from Makers and Making and examine OELCs around these activities. Specifically, students will gather data from these sites in accordance with their Terms of Use and then clean the data for analysis. We are interested in understanding users motivations for participation, the nature of interaction (linguistic as well as sharing of artifacts), and evidence of learning (e.g. posts about successful completion of a project). We also want to understand how a Community of Practice evolves around Making and Makers, the varied roles participants take on, and the kinds of support for learning it provides to newcomers. To that end, REU students will develop advanced text clustering approaches for analyzing the discussion forum data and use a bi-partite matching technique to group communities together. Evaluation will folllow a mixed-methods approach where interviews will be setup with the OLECs to evaluate the mining results.

Analytics for Online Programming Studies (Drs. LaToza and Snyder)

Researchers designing new programming tools and language features require data to understand the impact of these tools on developer productivity. Traditionally, this data has been limited, as finding sufficient participants among university students is challenging. Conducting such studies online offers new possibilities, making it possible to recruit across a wider participant population and potentially run far more studies than is currently possible today. To enable this, we have begun constructing a platform for online programming studies, offering the possibility of creating a wealth of data evaluating the effects of programming tools and languages on productivity. A key step in the development of this platform is creating the analytics capabilities for analyzing this data, enabling researchers everywhere using the platform to easily explore data from studies, test hypotheses, and generate findings.

REU students will develop a web-based data analytics platform for analyzing data from programming studies. A variety of data sources are collected for each study session and can be analyzed, including task time, task completion, and code created as well as tim series data describing code edits. The key analytical contributions will involve development of methods that mine heterogeneous, sequential data sources to model behaviors of participants learning to program. The REU students will develop dashboard views offering an overview of this data, enabling experimenters to see overall patterns in the data. Testing hypotheses is also central to data analysis of study data. REU students will work to build support for selecting dependent and independent variables and testing differences between conditions. Beyond simple hypothesis testing, being able to dig into data to understand differences is important. REU students will work to deeply integrate the data analysis views into detail views of items within the data set, such as code edits or task times, to enable researchers to qualitatively understand and make sense of this complex, multi-source, time stamped experimental data.