skip to main content

Important Notice

It appears you are using an older version of your browser. While some functions will be available, Alabama JobLink works best with a modern browser such as the ones provided by:

Please download and install the latest version of the browser of your choice. We apologize for any inconvenience.



Senior Site Reliability Engineer

Click the Facebook, Google+ or LinkedIn icons to share this job with your friends or contacts. Click the Twitter icon to tweet this job to your followers. Click the link button to view the URL of the job, which then can be copied and pasted into an e-mail or other document.

Job Details
Job Order Number
JC148306290
Company Name
Humana
Physical Address

Birmingham, AL 35205
Job Description

Description

Site Reliability Engineers are software development experts who handle the following responsibilities in a company: improving application lifecycle, evolving software systems to increase their reliability, monitoring application performance, and ensuring overall system health such as: high availability, low latency, top performance, high efficiency, effective change management, continuous monitoring & alarming, emergency response, and capacity planning. They act as a bridge between development and operations teams by applying a software engineering mindset to system administration topics.

Responsibilities

Job Description Overview:

Site Reliability Engineers are software development experts who handle the following responsibilities in a company: improving application lifecycle, evolving software systems to increase their reliability, monitoring application performance, and ensuring overall system health such as: high availability, low latency, top performance, high efficiency, effective change management, continuous monitoring & alarming, emergency response, and capacity planning. They act as a bridge between development and operations teams by applying a software engineering mindset to system administration topics.

+ Building software to help operations and support teams: SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.

+ Fixing support escalation issues: A site reliability engineer can expect to spend time fixing support escalation cases. Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.

+ Optimizing on-call rotations and processes: Site reliability engineers will need to take on-call responsibilities. The SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes. SRE teams will help add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents.

+ Documenting “tribal” knowledge: SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.

+ Conducting post-incident reviews: SRE teams need to keep teams honest and ensure that everyone – software developers and IT professionals – are conducting post-incident reviews, documenting their findings and taking action on their learnings. Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.

Responsibilities (representative examples):

+ Capacity planning and management – create, use, maintain a capacity model for cloud based implementations.

+ Performing continuous integration and delivery as well as to Implement, test and monitor new microservices & trouble shooting of related deployment issues on Linux systems.

+ Collect and maintain a complete inventory of all systems. Identify and retire unused systems to recycle resources and reduce maintenance costs.

+ Create and maintain documentation of systems and processes for existing and new systems; as well as Configure and maintain Puppet/Ansible/Chef cookbooks for all deployed environments

+ Deploy and monitor instances and services in cloud based environments as well as to Identify and correct the root cause of various system alarms; as well as recommend changes to avoid their recurrence.

+ Provide systems support by participating in rotational on-call support by executing emergency recovery, maintenance and upgrades during weekend and evening hours when required.

+ Serve as an escalation point for other Systems Administrators, Engineers, and other technology teams in the resolution of server and system problems.

+ Lead & contribute in the proof-of-concept, implementation and maintenance of automation tools used in the management of our infrastructure.

+ Plan, schedule, test and perform software installation and upgrades.

+ Build, administer, and troubleshoot all mission critical environments (Production, Stage, Dev, Test, QA)

+ Leverage automation tools, especially Bash, Powershell and Puppet, in order to decrease end-to-end deployment times, reduce downtime, and increase reliability.

+ Implement and maintain monitoring solutions at the server and application level in order to increase visibility into day-to-day operations and issues, utilizing Nagios & Elk/Splunk

+ Lead initiatives to transition critical software services into the Cloud, and provide training for other employees on the Cloud transition process for other portions of the product/organization.

+ Generating well defined and documented standard processes for the enterprise.

+ Provide solutions for performance management, disaster recovery, monitoring and access management

+ Work/support business users to understand issues, develop root cause analysis and work with the team for the development of enhancements/fixes

+ Provide engineering design across different workloads including incident & problem management, change management, security and compliance

+ Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth

Required Skills:

+ 5+ years Industry (post-graduation) experience in designing/developing, testing and supporting a highly scalable, highly available online service.

+ 5+ years Industry (post-graduation) experience in working with a cloud based environments (AWS and/or Google and/or Azure)

+ 5+ years Industry (post-graduation) experience working with Linux and the Windows operation systems.

+ 2 + years Industry (post-graduation) experience in configuration management frameworks and experience using tools such as Puppet, Ansible and Chef.

+ 2 + years Industry (post-graduation) experience in distributing processing frameworks like Spark and orchestration frameworks like Kubernetes and Docker Swarm for microservices.

+ 2 + years Industry (post-graduation) experience in scripting languages (Bash, Python & PowerShell).

+ Working knowledge of TCP/IP, TCP/UDP as well as working knowledge of routers, switches, firewalls/VPNs and higher-level protocols like HTTP and DNS.

+ Working knowledge of monitoring & alarming tools like Nagios and Ele/Splunk

+ Working knowledge of relational and non-relational databases: MS SQL, MySQL, Postgres, Oracle & Mongo

+ Ability to tro


To view full details and how to apply, please login or create a Job Seeker account.