dan slimmon .com
Dan Slimmon
dan@danslimmon.comProfessional Experience
Hashicorp: Staff SRE, Production Engineering (Apr–Nov 2023); Senior SRE, Production Engineering (Aug 2022–Apr 2023)
I proposed and founded the Production Engineering team, whose mission was to
identify problems in the behavior of production systems and get them fixed
before they got worse.
- Leveraged developer time on a weekly rotation basis to comb through the data and act upon our findings
- Built a community of practice around the Production Engineering approach, bringing together engineers from many teams to investigate and solve production problems in an evidence-oriented way
- Eliminated dozens of customer-facing failure modes and threats to stability:
- Reduced PostgreSQL resource consumption by more than half, thereby solving time-critical scaling and reliability problems
- Used queueing theory to explain the obscure cause of API request timeouts
- Documented and resolved many classes of race conditions that were eroding data integrity
Hashicorp: Senior Site Reliability Engineer, Terraform Platform (2019 - 2022)
My team was responsible for the care and feeding of Terraform Cloud:
Hashicorp's first real foray into software-as-a-service. I focused on
the design of observable systems and the constant adaptation of our team's
procedures and practices.
- With my team, reimagined and replaced Terraform Cloud's job pipeline, delivering a scalable, secure, and reliable architecture, all without interruption to ongoing business.
- Created an incident command training program and led a team of volunteer incident commanders from diverse career backgrounds.
- Established practices throughout the organization for finding and fixing customer-facing issues (standardized dashboards, troubleshooting techniques, and signal-to-noise ratio tuning)
- Managed the infrastructure underlying Hashicorp's flagship SaaS product – Terraform Cloud – which infrastructure comprises Terraform, Nomad, Consul, and Vault.
- Contributed to the Terraform Cloud codebase in both Ruby and Go.
- Gratuitious technology namedropping: Ubuntu, Consul, Nomad, Terraform, Vault, PostgreSQL, AWS, Datadog, Ruby on Rails, Go.
Etsy: Senior Operations Engineer, Observability Team (2016 - 2019)
As the most senior engineer on a team responsible for the observability
of Etsy's systems, I took on a role of technical leadership. While I did
implement solutions myself as needed, my focus was on developing the skills
of junior engineers, communicating across team boundaries, and ensuring that
our team's work was well organized and well planned.
- Organized the effort to migrate all observability systems (including metrics, logs, and alerting) from data center colocation to Google Cloud Platform.
- Collaborated with compliance engineers and auditors to redesign SOX compliance alerting pipeline and better match technology to the problem.
- Reduced team's on-call alert noise by an order of magnitude by introducing a regular on-call review.
- Directed the homogenization of disparate Kubernetes-resident Prometheus instances used in production by different teams, as well as the placement of these systems under centralized supervision.
- Managed several ELK clusters ingesting up to 8 TB per day of heterogeneous log data, with 24/7 uptime requirements.
- Rearchitected ELK deployments to include a decoupling layer in the form of Apache Kafka, with the ultimate goal of outsourcing ElasticSearch.
- Mentored junior engineers in coding and systems architecture.
- Seized every opportunity to improve the usability and efficiency of SOX compliance controls.
- Gratuitious technology namedropping: CentOS, Chef, Graphite, Grafana, ELK, Docker, Kubernetes, Google Cloud Platform, Prometheus, Kafka.
Exosite: Senior Platform Engineer (2013 - 2016)
I joined Exosite, a Minneapolis-based Internet of Things startup, as its first
full-time operations engineer. As the company rapidly grew, my responsibilities
ranged from software performance analysis to drafting security policy to
implementing customer-facing IoT applications. But I was always, first and
foremost, an ops engineer.
- Diagnosed baffling crashes and service degradations armed only with an R console and boundless optimism.
- Fostered a data-oriented culture by making it simple to expose and analyze metrics, and by working with developer teams to build flexible metric dashboards.
- Popularized blameless post-mortem analysis and raised company-wide appreciation for the New View of safety.
- Designed and wrote several internal webapps and customer-facing services.
- Represented the company with speaking engagements at Monitorama, O'Reilly Velocity, & DevOps Days.
- Organized office happy hours, an Observability Guild to spread data awareness through the company, and the biweekly Paper Club, where coworkers get together to discuss an academic paper.
- Gratuitious technology namedropping: Ubuntu, Amazon AWS, Linode, ELK stack, Graphite, Grafana, Jenkins, Ansible, Salt, Python, R, Go, PHP, NodeJS, Convox, OpenShift, Docker, Nagios.
Blue State Digital: Operations Team Manager (2011 - 2013)
When the manager of Blue State's web operations team left the company, I applied
for that job and got it. During my tenure, I shepherded the team through the
difficult 2012 elections, ensuring the continuity of donations and grassroots
organizing efforts by Obama for America, among other campagins and organizations.
- Managed a team of 4 ops engineers, encouraging an infrastructure-as-code approach and collaboration with software development teams.
- Maintained 99.9% uptime of bulk email, web, and fundraising services.
- Oversaw the design and implementation of an email infrastructure capable of sending up to 100 million messages per day at sustained rates of 15 million per hour, with bounce rates consistently under 1%.
- Cultivated strong relationships with clients such as the It Gets Better Project and Obama For America through clear, direct communication.
- Organized weekly office bar trips.
- Gratuitious technology namedropping: CentOS, Amazon AWS, Chef, Apache, BIND, MySQL, AMQP, Logstash, Graphite, Ruby, Bash, R, Python, Perl, Nagios.
Blue State Digital: Senior Linux Administrator (2007 - 2011)
Fresh out of college and with no professional Linux experience, I dove into the
task of organizing the patchwork collection of procedures and scripts that held
together the rapidly growing infrastructure of Blue State Digital. I honed my
skills quickly, taking immense pride in my work.
- Administrated 100 or so CentOS servers, including physical machines, VMWare guests, and EC2 instances, running the LAMP stack.
- Implemented Chef after evaluating a range of configuration management systems.
- Led a small team in designing and coding a fault-tolerant web app for creating client accounts and managing their Apache and BIND configurations, targeted to a non-technical user base.
- Managed terabyte-scale, mission-critical MySQL databases with multiple replicas.
- Responded to system emergencies in an on-call rotation.
Personal Projects
- Blog: Blog with a nominal focus on DevOps, but whose topics range from queueing theory to psychology to medicine
- QSim: Queuing theory simulation framework in Go
- Oscar: Home automation tool that adds items to your grocery list when you run out
- SecretShare: A system for sending secret data securely to coworkers and friends
- AWSBill2Graphite: A script that turns AWS billing data into useful graphs
Education
Wesleyan UniversityB.A. Physics and Mathematics, 2007