Systems View

Operations Research/Analytics Consulting

Data trumps algorithms in machine learning

1/10/2020 - 11:00 a..m.

In my experience, developing good predictors (“features”) in your input data has a much greater impact on solution quality than excessively tuning algorithm parameters or finding the absolute best predictive algorithm among good alternatives.  I first learned this when competing in Kaggle competitions, where the quantitative effect of such an approach could be immediately seen on the leader board.  You would often see competitors who focused on algorithm-tweaking only move incrementally upward over time, with marginal improvements in their scores.  But the addition of a single good predictor variable could result in leapfrogging to the top tier.  My recent experience in the health care industry reinforced this perception, but also highlighted another aspect.  That is, data development can consume far more time and resources than algorithm development.  In our case, we were working with medical claims data to make predictions about future utilization and costs, a daunting task, especially at the individual level.  Identifying a category of medical event often involved complex logic over diagnostic codes, procedure codes, facility codes, revenue codes, date sequences, and more.  Throw in the fact that the logic sometimes needed to efficiently operate over millions or even billions of medical claim records.  The result of all that effort might be a single column of reasonably reliable binary indicators (note that production runs would typically involve computing many such predictors simultaneously). In other applications, the choice of feature is not at all clear cut.  For example, a single feature might be needed to represent a specific chronological sequence of events, or geographical juxtaposition of interacting elements.  Identifying these from raw data can be very computationally intensive.  Recent developments, such as scalable graph data structures and algorithms (e.g. https://www.oreilly.com/library/view/graph-algorithms/9781492047674/) , can be useful here.  Given the ubiquity of often free high-quality machine learning software libraries (e.g.  https://scikit-learn.org/stable/index.html), application development should always put the greatest focus on the data.

Python geospatial analysis tools

12/10/2019 - 7:23p.m.

I recently reviewed the functionality of a python software module called GeoPandas (http://geopandas.org/).  In the past, Systems View has developed complex routing software in C#.  The application required writing low level code for tasks such as: converting between different map coordinate reference systems, computing distances, and determining attributes of polygons.  GeoPandas performs these and other complex mapping tasks under the hood - a few lines of code does what previously required lengthy mathematical, geometric, and trigonometric subroutines in C#. Geopandas also can read directly from geographic shape files, which are a standard for GIS.  I see how all of this functionality could be leveraged for an upcoming client engagement, where it would be useful to determine the intersections between geographic subunits (e.g. zip code areas as shape files) with known demographics, and superimposed “effect” regions.  In this way, demographically driven effects could be predicted for the region, perhaps in conjunction with machine learning techniques.  For those interested, Kaggle has a great free introductory course at https://www.kaggle.com/learn/geospatial-analysis.

Systems View reboot

11/1/2019 - 8:15p.m.

!systems

I started Systems View in 1997.  In 2013, I was asked to consult for NextHealth Technologies, a Denver start-up that was looking to use advanced analytics to help health insurance companies better engage with their members to reduce costs and improve health outcomes.  In 2014, I agreed to join NextHealth as an employee, and advanced to the position of SVP of Analytics before retiring in October 2019.   I had some great experiences at NextHealth, including the design and/or development of many of the algorithms at the core of their product.  I had the opportunity to learn a lot more about simulation-optimization, machine learning, bootstrap and Bayesian statistics, decision trees, and propensity score matching - not to mention learning about start-up company evolution, venture capital, stock options, boards, and all manner of corporate management processes.

During my time at NextHealth, I kept the lights on at Systems View with a few small legacy projects for some previous clients. Now that I have left full time external employment, I have the opportunity to apply my knowledge to some new endeavors.  ( I am starting with this first cut at a redesigned web site!)   Looking forward to this next phase.

I recently coauthored an interesting book

10/30/2019 - 10:15a.m.

CoxPopkenBook

I was privileged to assist with the creation of an interesting book in collaboration with Dr. Tony Cox and Richard Sun.  I have worked with Tony, of Cox Associates, for over 20 years, and this book describes several key projects we have worked on.  These projects all involved health risk analysis, and the book describes in detail the methodologies we used.  The emphasis of these studies was on causal analytics, versus the many studies out there that mistake correlation for causation.  In fact, many of our efforts were motivated by a desire to (in)validate conventional wisdom in areas such as animal antibiotics and air pollution, where correlation has been misinterpreted.  However, the book is much more than a compilation of studies.  Tony does an excellent job of drawing higher level lessons and providing a how-to guide for risk analysts.  The book definitely has a point of view.  My favorite chapter title:  “Improving risk management:  from lame excuses to principled risk management.”  Available from the publisher or Amazon