A primer on contemporary online ad-trackers: cookies and machine learning

Arpan Roy
7 min readFeb 28, 2020

Today an average consumer does most of their shopping online. Specifically, if you are searching for something particular and can’t find it in stores, chances are that you will find it on some e-commerce portal. E-commerce websites (Amazon, eBay) are thriving and re-investing major chunks of their profits into targeting more online shoppers. In turn, ad-tech has been thriving and growing nourished by this steady influx of revenue. Initially, publishers charged advertisers based on click through rates (CTR) i.e., how many times an ad gets clicked on. But viewing an ad without clicking also counts as marketing-as-a-service that helps spread brand awareness. Hence most publishers/ad-tech services have moved to a model of charging ads based on CPMs i.e., cost per thousand impressions/views. Even 10 years back, marketing professionals relied a lot more on search engine optimization or SEO for building marketing campaigns for clients. SEO strategies are particularly taking a hit nowadays as search engines like Google are curating content scraped from its top ranking search results using machine learning and displaying it as ‘Google Answer’ atop its search results. This obviates the need for the user to click on the publisher’s links altogether. With improvements in machine learning, targeted advertising is more widely accepted as the more viable marketing option over passive methods such as SEO. So don’t be surprised if you find yourself being stalked by ads of the same item that you searched on some website in a spur-of-the-moment thing across multiple domains and websites. It is likely these websites share the same ad targeting service, such as Doubleclick (now owned by Google). Ad targeting services, formally referred to as demand-side platforms or DSPs use third-party cookies as an identifier for a user and his search history. Third party cookies get created at the time of first search of an item on some website. Blocks of third party javascript code called tracking pixels are part of most webpages (commonly in the form of an image box). This pixel initiates the third party cookie creation process. The DSP generates the cookie content (unique hash) on their server and stores it in a local text file linked to the user’s browser. This cookie then helps the ad-tech service (and by association the publisher) to track the user across multiple domains. For example, open this url. Since this website is hosted in Europe, it will automatically ask you to accept its third party cookies. Click on ‘See Vendors’ and you will see a list of DSPs that this website shares it data with. Facebook is known to track users using third party cookies. The pixels take the form of Facebook like or share buttons embedded in pages in other domains. Just as ad technology has evolved so has ad-blocking tech. Pop-up blockers have made way for ad-blockers.

If you don’t want to be tracked, don’t go on the internet

Cookies are very useful for DSPs serving various e-commerce services. But from an online user’s perspective, cookies may pose a threat to his/her privacy.

  • First party cookies are not too much of a privacy concern. These cookies are created by a specific webpage to store some unique identifier that maintains the user’s active logged-in session and tracks user’s browsing activity specific to that website’s domain only.
  • Third party cookies are more invasive in their data gathering approach as they aim to track the user across multiple domains. Though guidelines dictate that no sensitive information should be gathered or stored in a third party cookie, there are publishers that end up capturing potentially sensitive data in third party cookies such as the user’s IP address, physical location, email address and browser history. This is all the information that can be used to create a dangerously accurate online profile of the user. The publisher can also sell this user data to data vendors. A user can protect oneself by disabling third party cookies in their browser’s settings. Users can opt to use an adblocker and/or switching to browser incognito mode or switch to alternative search engines like DuckDuckGo which does not collect or store user search history.
  • There exists another class of lesser known cookies known as supercookies. Third party cookies implement tracking in the presentation layer or frontend. Supercookies implement tracking at the network layer by injecting http headers (Unique Identifier Headers or UIDH) in the user’s traffic and are therefore harder to block at user level. Regulatory body FCC recently directed service provider Verizon, known to profile users with UIDH that they need to offer users the option to opt out of such acute probing.

Apple is well known for their privacy first approach. Since Apple doesn’t have a vested interest in online advertising (unlike Google), they have disabled third party cookies in their Safari browser and iOS app store by default. Additionally they recently moved to limit the storage of first party cookie data to a maximum of 24 hours. However such measures will affect publishers, malicious and non-malicious alike.

Ad-tech, data gathering and machine learning

Data management platforms or DMPs such as Datalogix (now owned by Oracle) or Salesforce DMP have data agreements with web-service providers Facebook or Google Search. These data vendors buy Facebook or Google’s collected user data and apply machine learning to develope online user profiles and behavioral models which they subsequently sell to demand-side platforms. Advertisers then factor in this data in their CRM campaign management to maximize the impact of their ad campaigns. And all this happens in real-time with full integration between DSPs and DMPs. Applications of machine learning is pervasive today and ad-trackers are no exception. If the user didn’t buy an item already, it makes sense to show him ads of the same item. But if he has clicked through, the advertiser needs to show him other items. Equipped with historical data of items browsed by users over time, the DSPs use collaborative filtering based recommendation engines to show the user similar other items that they may be interested in. To formulate this as a collaborative filtering problem, each user and each item need to be quantified by a vector of a common set of features. For the ecommerce use-case, the features can be a list of item categories e.g., household, apparel, technology, educational. Each user is assigned a rating for each item category based on how often he browses/buys an item in that category. Similarly each item can be rated on its likelihood of belonging to each category. This vectorized representation of each user and item serves as input to the recommendation engine. Another use-case for machine learning in ads would be in dealing with users who are blocking ads. In such scenarios specifically for mobile users, publishers have to resort to device fingerprinting. Device fingerprinting involves analyzing generic data captured from users such as IP address, time zone, browser settings, color depth, language settings, screen resolution, OS version. This data is usually gathered by publishers to render best possible version of their web-page for the user. The DSP can train a neural network on this device dataset. Then on future visits, the advertiser can identify the user by running the same captured parameter values on this trained network. The type of neural network to be used is chosen based on the complexity of data gathered. For multidimensional data, convolutional neural nets(CNN) can also be used. Ads are important for the publisher as they allow the publisher to pay for web-hosting without charging the user for visiting the publisher’s pages. One approach that allows for data gathering while protecting user’s privacy is differential privacy. In differential privacy, some random noise is injected in the user data while gathering such that the user’s identity is fully obfuscated while the basic underlying patterns captured in the data are left intact. For example before sending any data (for each bit 0 or 1), the differential privacy algorithm will flip a coin and answer honestly if it comes up heads. If it comes up tails, it flips a second coin and returns a random answer: 0 for heads, 1 for tails. This anonymized data is gathered from a large number of sources. It is then mined for patterns using machine learning to calculate an expected value of the actual data. The probability of the coin coming up heads or tails governs how much actual user data gets leaked. Apple and Google are known to use this approach to gather macOS and Chrome data respectively. Publishers also need to find their way around local regulatory compliance such as General Data Protection Regulation (GDPR) put forward by European Union that regulates the flow of non-anonymized data outside the EU and compels publishers to inform the user on what user data is being collected on each website. So if you visit a web-page hosted out of Europe these days you can expect a large pop-up listing the DSPs operating on that site and explicitly requesting your permission to accept their cookies.

Footnote: You know how that news site you visit all too often informs you that you have reached the maximum number of stories you can read this month without a subscription. Most of the times you can get around it by opening the page in incognito mode. But I don’t recommend it. Publishers are creative people and should get their due. Not to mention, these sites are using the less-invasive tracking technologies like cookies. In more recent news, Google published promising click through rates for advertising using a new privacy preserving approach in the form of Federated Learning of Cohorts (FLoC) API. FLoC pools the users into groups based on their common interests where each user gets assigned a cohort id and advertising is shown based on this cohort id. There is a privacy-utility trade-off here where more users in the same group will increase each user’s anonymity (k-anonymity in a crowd of k users) but at the same time make it difficult to push personalized recommendations for each user.

Useful links:

  1. https://www.facebook.com/business/news/good-questions-real-answers-how-does-facebook-use-machine-learning-to-deliver-ads

--

--

Arpan Roy

I write about technology-related ideas and tools I’ve experimented with. I enjoy reading up on new software, space and fiction written by old British ladies.