Photo and article by Maya Maaloul
What makes changes in data significant? How do we, as humans with subjective opinions, decide this? What methods can detect these changes the best, and what if we can’t find the best tool?
During an Amherst College Statistics and Data Science (SDS) Colloquium on September 24, Ben Baumer, a professor of statistics and data science at Smith College, addressed some of these queries.
Detecting and tracking changes in data patterns is a difficult process. The real world does not present itself linearly, and there is no single formula that denotes patterns in data collection. These concerns relate back to the notion of changepoint detection—an important concept in data science. Simply put, a changepoint is exactly what it sounds like! It’s the point at which the data change in some significant way. It follows that changepoint detection (CPD) identifies changes in data (e.g., the distribution of values, which could deal with averages, for example). CPD helps researchers understand and pinpoint the causes of changes in data in subjects ranging from healthcare (think ECG waves) to the stock market (such as identifying economic trends).
CPD is challenging for several reasons. For one, we can look at a graph and determine when the shape seems “off” or “inconsistent” from its previous appearance, but these variances may not mark an actual anomaly. When inconsistencies in data are less obvious, it is also quite challenging to determine what changes (if any) are significant, or statistically meaningful. Data enables specialists to create predictions for the future, so if we have trouble tracking true shifts in patterns, that can unfortunately cause misinformed projections.
Professor Baumer previously worked as a statistical analyst for the New York Mets, but it was when he moved from the baseball field to the software development field that tidychangepoint came about to address the challenges of CPD.
During his time on sabbatical, Baumer spent time at Universidad EAFIT in Medellín, Colombia, where he conducted joint work with his colleague and statistician Dr. Biviana Marcela Suarez Sierra to develop a package known as tidychangepoint. At this fall’s SDS, Baumer delivered a compelling talk on tidychangepoint and its problem-solving potential.
The name for tidychangepoint comes from the term “changepoint,” as discussed already, but also a package known as “tidyverse” used on a statistical software called R. Tidyverse helps users visualize their data, and tidychangepoint’s syntax—a set of rules that explain the wording and punctuation of a programming language—is compatible with that of tidyverse.
There are many algorithms that data scientists use to analyze and identify changepoints, all of which are performed using R. Some include Pruned Exact Linear Time (PELT) and genetic algorithms. These algorithms (and others) are complex in their own regard, but the most pertinent detail to note is that each code is used for specific situations. For example, PELT needs its data to meet certain conditions that genetic algorithms don’t necessarily require. However, genetic algorithms are very time-consuming to execute, and statisticians often don’t have the time to separately test out several codes. Therefore, tidychangepoint serves as a catch-all, or a unifier that can try out multiple algorithms in a compressed time frame, therefore optimizing and expediting the CPD process.
In other words, tidychangepoint can perform several changepoint detection algorithms simultaneously. This package makes it possible to compare the results of various programs and choose the best algorithm (or changepoints) for a set of data. By measuring the shifts in values before and after sudden changes, data scientists can then create better forecasts and more accurate data interpretations.
As Baumer puts it, “tidychangepoint can help data analysts compare the performance of many changepoint detection algorithms quickly and easily.”
Missed this SDS Colloquium but want to check out future seminars? See https://nhorton.people.amherst.edu/colloquia/ for more information.