How Pinterest tracks 2.5 million metrics all while improving the monitoring user experience
Dealing with 2.5 million metrics per second and 20TB of log data is all in a day’s work at social network Pinterest. But were the monitoring tools giving the engineers the best and most productive experience? Pinterest Software Engineer, Amy Nguyen, spoke a TechSummit Berlin about the steps to creating a better user experience (UX) for your monitoring operations to make life easier for the engineers and administrators around you.
With 150 million active users and 100 billion “pins”, Pinterest is one of the world’s most popular social networks with serious Web scale monitoring requirements.
The company’s 400-strong engineering team rely on Pinterest monitoring to improve apps and operations amid 150,000 requests generating 2.5 million metrics every second. A huge 20 terabytes of data is logged per day.
Owing to its rapid growth, Pinterest engineers have been developing in-house monitoring tools using a combination of OpenTSDB, Graphite and its own user interface called StatsBoard. During the past six months Pinterest has been overhauling the user experience of its monitoring stack.
Pinterest Software Engineer Amy Nguyen, joined the company in 2015 and for most of that time has been on the visibility team.
“We had a lot of issues with how people were using our monitoring tools in the past, so we identified a bunch of ways we could improve that,” Nguyen said.
“I started in a different team, but one of our projects was related to monitoring and I found it to be a terrible experience. We had bad, outdated documentation and the instructions were often wrong. I would post questions on our Slack channel and nobody would answer.”
To improve this Nguyen spent a lot of time figuring out what was going on with StatsBoard and then committed code to the project even though it wasn’t the mission of her team.
“When we restructured I was in charge of improving StatsBoard and I wanted to improve it for all the users like me. There were engineers across the organization who had a terrible time using StatsBoard.”
Nguyen communicated the virtues of becoming a “10x engineer” by making it easier for 10 other engineers to do their job. She said even if you don’t have you own in-house monitoring tools you can take the Pinterest experience and apply it to your teams to improve how they work with other engineers at your organization.
From documentation to UX improvement
How can organizations improve their monitoring experience? Nguyen says start by improving your documentation.
“Where is your documentation? Is it in Google Drive, in a wiki, or in an email that you forward around? Was it left with engineers who have left the company? And so on,” Nguyen says. “This was happening at Pinterest and it was a huge problem for us because everything was duplicated across multiple sources. For new engineers it was hard to figure out where to find things. I would go into Google Drive and do a keyword search and hope to find a document that was relevant.”
To overcome this the monitoring team created a wiki page with all of the information about StatsBoard, logging tools and other tools so people would know there is exactly one page to go to and find the answer to a query.
Not assuming any priory knowledge of terms used for monitoring is also beneficial and make sure your instructions are kept up-to-date.
“It’s easy to write up instructions when you need them and then never look at them again,” Nguyen said. “There is an idea if you have a piece of software and it works and nothing changes it will continue to work forever. That’s not true and things will stop working eventually so it’s really important to check if your instructions make sense.”
Three tips for maintaining good documentation are adding common questions as they arise; linking to documentation to teach people there is a place to go to get answers and you don’t have to bother a real person; and encouraging everyone to contribute to the documentation, even if they are not on your team.
How do you improve the experience of working with the monitoring team? Start by offering many ways to be contacted – from email to IM and Slack and HipChat.
“It's really helpful if you can offer all those kinds of communication methods so people can choose the one that's best for them,” Nguyen said. “In our case we have office hours where people can come to our desks and troubleshoot in person.”
The other big question is how long does it take for someone to have their question answered. Pinterest would have someone ask a question only to receive dead silence and the question would never get answered.
If someone doesn't have a way of getting a question answered they would either go and ask someone else on a different team (and receive potentially misleading information) or just give up and not bother at all.
“That is a shame as you are trying to help people monitor their tools and services more effectively so you should want them to come to you and ask for help,” Nguyen said.
“What happens when someone sends you a complaint or request? You could say ‘no’ and give them an explanation of why you are not doing it or you could file ticket and say we are going to consider that next quarter. These things have the same end result but the huge difference is how that person feels about the feedback, if people feel ignored they will feel they can't work with your team.”
At Pinterest user experience is really about the experience of working with people and teams, not just user interfaces.
Next up, improve the UX of working with tools
The first principle of a good monitoring tool is to reward “exploration” within the tool and Nguyen has seen a lot of monitoring tools which don't quite get this right.
Make sure simple things like not losing progress by hitting the browser’s refresh or back button are embedded in the user experience.
“Also people shouldn't be afraid of doing something that will break everything,” Nguyen said. “People are afraid of making changes on Kibana because they don't want to ruin things for all of their co-workers. It should be easy to reverse things if they have made a mistake.”
Other tips include set reasonable defaults for monitoring apps and don’t make too many assumptions as what you want as an engineer or expert is probably not what your users want.
“Even if we think it is right that doesn't mean it is right,” Nguyen said. “The last lesson we learnt was what you know is probably not what your users know.”
Pinterest gets around 150000 requests per second and we is storing 1000 logs per minute for error logging. Some Engineers would say ‘I see an error on this host, but I don't see it in the logging system’.
“That's because the logs gathered were just a sample. In other areas you never throw away data, but in monitoring we usually store things for 30 days or less than that,” Nguyen said.
“There are some basic assumptions monitoring people make because they work with the data every day and you can forget other people don't realise these things. Checking your assumptions and making sure everyone is on the same page will save you a lot of time.”