MongoDB Days 2013 Interview with theCUBE

June 30, 2013
Interview with theCUBE (https://www.siliconangle.com) during on June 22nd, 2013 at MongoDB Days 2013 in New York City. Video and slides of my talk available at http://www.10gen.com/presentations/business-track-how-criteo-scaled-and-supported-massive-growth-mongodb

Transcript

Music. Hi everybody, we're back. This is theCUBE, SiliconANGLE's continuous production of MongoDB Days. We're here live in New York City. I'm Dave Vellante and I'm with my co-host, Jeff Kelly. Julian Simon is here; he's the Vice President of Engineering at Criteo. We're going to talk about how they're using MongoDB and how they're adding value. Welcome to theCUBE. Thank you, thank you for inviting me. It's our pleasure, good to see you. So tell us a little bit about Criteo, what you guys are doing and what you're all about. Criteo is a French company, started in 2005, and we work in the online advertising space. We're a global leader in what we call performance display. In a nutshell, what we do is build and serve internet advertising banners, personalized in real time for every single display, for our customers, who are the main e-commerce websites, retailers, and brands. Okay, so let's talk a little bit more about how you do that and how MongoDB fits into that infrastructure. How do you use MongoDB? We use MongoDB to store what we call the product catalogs, which are the product catalogs of our customers. This is the starting point of our platform because we need to show product images and information in the banners. We ingest product information from our customers, over 3,000 major e-commerce companies worldwide, across more than 30 markets. We use MongoDB to store this information, which is then fed to our web servers and ends up in our banners. How does the system figure out what banners to serve, and how frequently does it update and refresh that model? We have our own algorithms, which we call prediction algorithms, used to decide whether to buy advertising space in real time for a given user. The decision is based on whether the price of that space is compatible with the chance of generating a click. If we decide to buy the space, we need to recommend the products that will have the best chance of success for that user at that time. Our technology relies on algorithms and data, and part of that data is the product data stored in MongoDB. Your customers get paid for the clicks, and your customers' customers get paid for conversion. The better you are at your job, the more conversion occurs and the more value gets created. Exactly, and that's why we call it performance advertising. Criteo only makes money if we deliver clicks. If we buy advertising space to display banners that don't generate any clicks, our customers, the advertisers, don't pay us. We need to be very smart or we'll be very dead very quickly. As the VP of engineering, your role involves working on the product, the architecture, and everything in between? I'm the plumber. My team and I are building a highly scalable platform that serves over one billion ads every day in more than 30 countries. The growth has been spectacular, and it's a constant challenge to scale both the infrastructure and the applications to keep up with the business. We have over 700 employees with offices in 15 countries, including Europe, the US, South America, Japan, Korea, and Australia. We need global infrastructure to handle this, and a large part of my job is ensuring we have the proper resources in Europe, the Americas, and APAC to keep growing and scaling our technical capabilities and the business. I was told you've had a growth rate of more than 200% in five years. That's a big number. Is that storage data-wise or revenue? That's revenue. It's difficult to comprehend, but the growth of the company has been spectacular. We have over 700 employees and a global presence. To handle this, we need global infrastructure. Are you running in the cloud or on bare metal? We're on bare metal. We have seven data centers: three in Europe, two in the US, and two in Japan. We rent hosting space, buy power, and maintain everything ourselves. Part of the team is on duty 24-7 because there are no business hours for us. It's a critical platform built from scratch and operated by us. Let's talk about MongoDB. MongoDB plays a critical role in storing the data that allows you to serve up these ads. Walk us through why you chose MongoDB and what attributes make it well-suited to your workload. We were using Microsoft SQL Server, which was fine for a while. However, the growth of the company, the number of customers, and the size of product feeds led to technical difficulties with SQL Server. We looked at alternatives and were keen on using open source software. MongoDB came out on top in our evaluation matrix. We like how easy it is to use, deploy, and manage. It has built-in scalability and high availability, which are very important for us, with features like replica sets and sharding. Let's talk about scaling. How does MongoDB handle that problem, and how will it continue to scale as your business grows? Scalability is always the number one issue. With MongoDB, the key is sharding, which allows us to spread the traffic over multiple masters for writes and multiple slaves for reads. We have quite a few servers to handle this. In terms of the analytics, we do real-time analysis, making decisions in sub-milliseconds on the best ad to display based on the user and our inventory of ads. No single technology can handle big data analytics and millisecond-scale processing. The heavy lifting is done in our Hadoop clusters, which process about 20 terabytes of additional data every day. The results of these jobs are fed into caching technologies like Memcache, which are queried in real time by the web servers. For decisions in a few milliseconds, you have to access data directly in caching. It's a multi-tier architecture: Hadoop for heavy lifting, MongoDB for product information, and caches and commodity servers for scale-out. There are discussions about ad tech, and Jeff Hammerbacher's famous quote about the best minds of his generation spending time figuring out how to get people to click on ads. One discussion is about bringing together analytical and transaction systems into a single database. Do you see this as a near-term reality or folly? There's a strong push to reconcile big data batch processing with real-time constraints. Hadoop is batch processing, but extensions like Storm allow for stream processing, which is interesting. MongoDB had a MapReduce framework from the start, but it wasn't very good. The aggregation framework is a significant improvement, but at our scale, it's impossible to use a single data system. Everyone wants big results in very little time, and there's a lot of interest, so startups and technology will catch up. As of today, we don't see any silver bullet, and we use a multi-tier architecture. How often do you reprice or change the pricing within the system? Every single impression is evaluated independently. If you look at two web pages in quick succession with banners, each display is evaluated independently. Our arbitrage process is real-time, and our models are refreshed multiple times per day. Every display modifies the state of the model. If we show the same banner 20 times and you never click, we learn quickly and stop buying that space. It's a combination of heavy lifting to pre-compute 99.9% of the problem and real-time decisions for the most relevant context. It's a very competitive space, and obviously, MongoDB is doing well. We interviewed another company yesterday, Velocity Aero Spike, and IBM Labs has invented technology in this area. What do you think about streaming technology, like H Streaming, which allows decisions to be made as data is ingested before persisting it? The technology Criteo uses is in-house. Our business app code is our own, and we don't rely on third-party solutions. People often ask if we use BI software or analytics modules, but we don't. We do use third-party technology like Hadoop, MongoDB, and some more, but the critical technology is our own. We have 170 people in R&D, and over 300 engineers working on the product, R&D, infrastructure, and QA. We build it, run it, and like to have full control. Part of the data and algorithms we use are strictly real-time, so if you do something a few seconds ago, we'll know and can use it immediately. This information is eventually persisted to logs and used to modify profiles and make more intelligent decisions. What's the next wave of innovation in the ad tech business? Is it just serving ads faster or making them more personalized? We always strive to improve our existing model, try new variables, and inject new data. We do a lot of A-B testing to prove or disprove ideas. The two things we're trying to improve are the click-through rate and the conversion rate. Advertising is about sales, and clicks need to become sales. Conversion rate optimization is important for us. Other products we could work on include mobile, which is a very interesting challenge. There are plenty of topics to explore, and we're not out of ideas. Julian, I really appreciate the information and stopping by theCUBE. You guys have a great story and are doing some really leading-edge stuff. Thank you very much. And appreciate the collaboration with MongoDB. Keep it right there, everybody. We'll be right back with our next guest. This is a full day, wall-to-wall coverage. This is theCUBE. Right back, this is Dave Vellante with Jeff Kelly. See you in a minute.

Tags

MongoDBCriteoOnline Advertising