Who is Sirdata?
Sirdata is an innovative French company, founded in 2012 and specialized in data processing.
The company collects, in the strictest respect of data protection and privacy legislation, raw browsing data of web users. By analyzing the semantics of each visited web page, Sirdata identifies weak signals in users’ interests and scores their intent degree in real-time. This know-how enables the company to create on-shelf and tailored audience clusters: interest, intent, life events, demographics, audience extension, brand – according to the targeting strategies of each Marketing, Data, and Communication professional.
Sirdata allows publishers to monetize their website traffic, and advertisers (or agencies mandated by advertisers) to target the right audiences at the right time. It also allows both of them to enrich their customer knowledge and enhance more precisely their data marketing strategies through their tools (SSP, DSP, Adserver, DMP, CDP, CRM…).
Sirdata is an Interactive Advertising Bureau (IAB) board member, a Turing Club, and Professional advertising regulatory authority (ARPP) member. They developed high scaled in-house solutions around data such as – among others, their semantic & scoring tool or their Consent Management Platform (CMP). Being a data provider with infrastructure security and server delivery is strategic for the company and its customers.
Sirdata gathers consented data from website visitors. Once collected and processed through its semantic & scoring Hub, the data is delivered either in Demand-Side-Platform (DSP), Sell-Side-Platform (SSP), Data Management Provider (DMP), Customer Data Platform (CDP), CRM. In short, any tool used by brands, publishers, and agencies to better drive an accurate, brand-safe, marketing user-based strategy. This is being possible thanks to the company’s know-how in data processing, API management, and tool interconnection.
Within the Programmatic and Martech industry, Sirdata will support Brands and their agencies who need data to decide whether or not they want to buy an ad placement in a Real-Time Bidding auction and so increase their chance to communicate with their targeted audience. For Publishers, an ad with behavioral data is eight times more valuable to sell than a placement without data (according to the IAB Europe). Thanks to Sirdata, they, now, have the opportunity to sell their inventory with data to better reply to the market requirements while increasing their revenues.
On top of that, the company has always been respectful of legislation and as a matter of fact, never collected sensitive data. The main aim of the Sirdata data processing is to understand consumer buying palatability for a product and so, support advertisers in the delivery of a meaningful and added value messages to their prospects and/or clients. For example, an advertiser like an airline company will only be interested in delivering a message for a holiday flight promotion to an accurate audience.
Seizing GDPR and upcoming legislations as an opportunity, Sirdata significantly invested in adapting its processes to the applicable regulations very early on. After a quite turbulent 2018 year during which Sirdata held firm positions on compliance, the company is now counting within the few ones fully consent-based by design.
A technical point of view
Currently, Sirdata has approximately 30 servers. The company uses two different offers of Dedibox servers. One with lots of storage spaces for their persistence needs, and another for their compute-intensive tasks. Mostly, they use STORE-4-XL for storage and PRO-4-L for their frontend.
Today, Sirdata faces regular traffic patterns, and that is why they decided to stick to Dedibox servers and dedicated hardware. Cost-effective infrastructure is also an important aspect.
For the crawling, Sirdata uses between 20 and 50 instances. Typically, the company buys the right to crawl webpages but some of their instances are blocked. As a result, Sirdata has resiliency built-in for its tasks. Its infrastructure is designed to be very strong and durable. Typically, the company is able to handle losing 30% of its infrastructure without having any impact on the production.
On the network side, Sirdata is at 1G, with peaks at 10G of network traffic.
Sirdata tools and software
Sirdata’s Infrastructure is managed with a heavily patched Kubernetes, and its traffic is load balanced using Nginx.
Its frontend is written in Go in a sidecar inside a pod, with a buffering layer handled by RabbitMQ, which enables it to handle disconnections rather easily. This type of configuration is very efficient as it allows it to manage 2000 queries per second per app/pod for only 200M of RAM used.
For its backend, Sirdata uses many JVM technologies. Its backend application is written using Spring Boot, and its Natural Language Processing algorithm is written in Kotlin. For the persistence layer, Sirdata uses Cassandra and Kafka.
What are Sirdata’s current challenges?
Sirdata was part of Kubernetes‘s early adopters. As the features they used were not native, Sirdata needed to patch them. Today, most of the features the company requires are available natively in the upstream distribution of Kubernetes.
As a result, Sirdata would like to migrate its infrastructure to an upstream Kubernetes.
On the NLP side, Sirdata is interested in exploring how GPU could help it to have an improved automated classification of web pages on the fly. Scraping remains the most time-consuming step in its pipeline.
“Scaleway is a European-based company and has an excellent quality-price ratio. We need this kind of infrastructure because the value of each of our queries is rather small, so we need to scale up to high volumes in order to have an attractive performance for our customers.” says Rémi Demol, Sirdata’s co-founder and CTO.
The company is looking forward to use Scaleway’s new and future offers, such as Big Data services. As they have a lot of Big Data workflows, they are keen on using services with a relevant value, which manage the risk and complexity away from them. Another scope would be semantic analysis on demand, and more generally, a toolbox for NLP that they could use as a service. Finally, according to their CTO, a managed Kubernetes on dedicated servers would also be very useful.