Gidi Meir Morris

Code

Github

All Professional & Shit

CV

Structured Greping Of Structured Logging

I’m not one to advocate for SSH-ing into a server and manually sifting through logs.

I’m a big believer in investing in your logging structure & infrastructure at a level that allows you to offload the log sniffing and analysis to a more suitable time and place.

That said, there are still times when I find myself piping around log files on remote servers. Recently, feeling fed up with one of the more tedious examples of this, I threw together a little tool that I think might be of use to others as well.

A Shed Load Of JSON

As I’ve mentioned before a core part of my job is to maintain a distributed system of Ad Exchanges. These Exchanges run real-time bidding auctions in which we offer buyers the opportunity of buying advertising real-estate.

Each auction produces a large amount of information which we log into files. These logs are used both for monitoring and in order to support analysis at a later time. As our Ad Exchanges communicate with the outside world by using the openRTB specification, a superset of the JSON format, we chose to log everything that happens within the exchanges as structured JSON as well.

There is a whole variety of events which we wish to log when an auction executes, ranging from an auction start event, when an auction begins, through to bid requests and bid responses which describe the back-and-forth communication with potential buyers of advertising real-estate, and of course a whole variety of follow up events such as a bid won or bid rejected event.

Whenever an event takes place within an Ad Exchange we write a JSON object into a log file containing all the metadata about the event. We try to enrich the log with as much information as possible, empowering ourselves and our analysts by enabling all sorts of business-related questions to be asked about what is happening within out system.

Ever wondered how often certain buyers bid on certain kinds of real-estate? No problem, thanks to these structured logs, we can analyse these events using third-party tools long after they took place in our system. Shortly after these log files are created on a server they get aggregated into a system maintained by our Sh↑ft team who own all our shared infrastructure (hence the name), which streams these logs into a variety of such reporting tools.

grep-ing JSON Is A Pain

Sometimes though, I might wish to take a look at these logs while they still reside on an Ad Exchange. There are several reasons why I might chose to do that.

For example, I may wish to get immediate feedback about something that is happening on an exchange (as it can take some time for the sheer amount of data to reach our reporting services).

Another example would be a case where I was investigate a discrepancy which, as I suspected at the time, was caused by an error in the creation of our structured logs rather than a real world error taking place within the Ad Exchange. Sifting through the logs confirmed that theory quite quickly.

When such a situation occurs I will usually SSH into one of our Ad Exchanges and use a variety of command-line tools to take a closer look at these log files, predominantly grep and jq.

grep is a tool which allows you to globally search through text files using regular expressions and print out every matching line.

jq on the other hand is a tool used to dig into JSON structures and manipulate them. This tool is very useful if you’re trying to extract data out of a complicated JSON object.

Why Is grep Unfit For Purpose?

I heavily use both of these tools, but often find both mildly infuriating as they will often fall short of my needs. Generally speaking, I use these tools when I’m searching for an answer to a business related question. For example, I might ask the question: “Are we getting any bid responses which include bids on a specific Private Marketplace Deal, which have an Ad Id on them and can be skipped?”.

In order to answer this question I need to find all the logged events written to the log when an Ad Exchange receives a Bid Response bidding to buy some space offered as part of a specific deal in order to place a skippable video ad in it.

I know, that’s a whole load of vague domain knowledge that means nothing to you, so let me clarify what this means.

A Bid Response is logged as a JSON structure which describes a bid by a potential buyer in an auction. In this bid the buyer can specify a variety of properties describing the use they would like to make of our advertising real-estate (in the case that they win the auction). Two such properties are a property specifying that they would like to buy this real-estate as part of a prenegotiated deal offered on our private marketplace, and an attribute specifying that we should offer the end user an opportunity to skip this ad if they would like to.

Bellow is an example of a logged event matching what I’m looking for:

{
	"id": "300eb402-3a4d-3486-b2e3-9d2a2ba53ebc",
	"seatbid": [{
		"bid": [{
			"id": "1",
			"impid": "300eb402-3a4d-3416-b2e3-9d2a2ba53ebc",
			"price": 8.49,
			"adid": "8756987",
			"adm": "<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>",
			"adomain": ["gidi.io"],
			"dealid":"UNRX-GIDI-9d2a2ba53ebc",
			"attr": [16],
		}],
		"seat": "10000001"
	}],
	"cur": "USD"
}

The important parts of this bid response are the following:

  1. The dealid field which described the ID of the Deal the bidder would like to buy under. In this case: UNRX-GIDI-9d2a2ba53ebc.
  2. The value 16 specified in the attr field, which is a Creative Attribute as described in the openRTB specification, tells us that this ad should be skippable.

If I were to try and use grep to find this JSON object in the logs I would have to try and use strings patterns to try and match with these fields.

Suppose, hypothetically, we had these events in a file called bidresponses.json which contained a whole load of bid responses, finding the bids on the deal would be easy:

$ grep 'UNRX-GIDI-9d2a2ba53ebc' bidresponse.json
{"id":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","seatbid":[{"bid":[{"id":"1","impid":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","price":8.49,"adid":"8756987","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[16]}],"seat":"10000001"}],"cur":"USD"}
{"id":"788be2a1-6c3a-3416-b2e3-867t786f57f5","seatbid":[{"bid":[{"id":"1","impid":"788be2a1-6c3a-3416-b2e3-867t786f57f5","price":2.49,"adid":"6736787","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidm.co.uk"],"dealid":"UNRX-GIDI-9d2a2ba53ebc"}],"seat":"10000001"}],"cur":"USD"}

As you can see our grep command matched two bids on our deal, but only the first is skippable.

Identifying the attr begins to get complicated. For example:

$ grep -E '(UNRX-GIDI-9d2a2ba53ebc|16)' bidresponse.json
{"id":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","seatbid":[{"bid":[{"id":"1","impid":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","price":8.49,"adid":"8756987","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[16]}],"seat":"10000001"}],"cur":"USD"}
{"id":"788be2a1-6c3a-3416-b2e3-867t786f57f5","seatbid":[{"bid":[{"id":"1","impid":"788be2a1-6c3a-3416-b2e3-867t786f57f5","price":2.49,"adid":"6736787","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidm.co.uk"],"dealid":"UNRX-GIDI-9d2a2ba53ebc"}],"seat":"10000001"}],"cur":"USD"}
{"id":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","seatbid":[{"bid":[{"id":"1","impid":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","price":7.49,"adid":"8762655","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[15, 16]}],"seat":"10000001"}],"cur":"USD"}
{"id":"ca3c53a1-207f-485d-9c19-bac702b1b06f","seatbid":[{"bid":[{"id":"1","impid":"ca3c53a1-207f-485d-9c19-bac702b1b06f","price":3.09,"adid":"20044561","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["wwf.org.uk"],"attr":[16]}],"seat":"623234"}],"cur":"USD"}

Sadly that didn’t work either. As the number 16 appears in both the attr field and in other places, such as a bid’s unique identifier, some irrelevant events are printed too. Not only that, but this also matched a bid response which wasn’t on the deal at all, but happened to be skippable.

At this point you might try the following:

$ cat bidresponse.json | grep 'UNRX-GIDI-9d2a2ba53ebc' | grep '16'
{"id":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","seatbid":[{"bid":[{"id":"1","impid":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","price":8.49,"adid":"8756987","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[16]}],"seat":"10000001"}],"cur":"USD"}
{"id":"788be2a1-6c3a-3416-b2e3-867t786f57f5","seatbid":[{"bid":[{"id":"1","impid":"788be2a1-6c3a-3416-b2e3-867t786f57f5","price":2.49,"adid":"6736787","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidm.co.uk"],"dealid":"UNRX-GIDI-9d2a2ba53ebc"}],"seat":"10000001"}],"cur":"USD"}
{"id":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","seatbid":[{"bid":[{"id":"1","impid":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","price":7.49,"adid":"8762655","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[15, 16]}],"seat":"10000001"}],"cur":"USD"}

This solves the problem of matching the bid responses that aren’t on the deal, but still matches the random 16 that appear in the object.

You might then try to match the end of the array too:

$ cat bidresponse.json | grep 'UNRX-GIDI-9d2a2ba53ebc' | grep '16\]'
{"id":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","seatbid":[{"bid":[{"id":"1","impid":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","price":8.49,"adid":"8756987","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[16]}],"seat":"10000001"}],"cur":"USD"}
{"id":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","seatbid":[{"bid":[{"id":"1","impid":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","price":7.49,"adid":"8762655","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[15, 16]}],"seat":"10000001"}],"cur":"USD"}

This worked, as we lucked out and 16 was at the end of the array, but what if a random bidder were to order these differently and instead of [15,16] they would have sent [16,15]? Or what if some other field had an array with the value [16] which doesn’t neceserily describe skipability?

And we haven’t even began to check for the presence of an Ad ID which appears under the adid field.

Clearly, answering this question using grep is very hard as the shape of our logs can be quite complex. Using the expressions supported by grep can get very unwieldily complex when trying to match up with these shapes and while jq is great for focusing in on certain parts of these logs, it isn’t great for getting the big picture in which those parts reside.

Clearly the issue is deeper:

There isn’t much point having structured logs if you can’t reason about them in a structured manner.

So I made a tool which would allow me to do so.

Jeff Goldblum aka jg - grep for JSON

I began to think about the complex questions I often ask when sniffing around in these logs and came to the following conclusions:

  1. I need a tool that could describe a JSON structure and match input against this structure
  2. I need complex ways of describing not just the shape, but the data itself as well, ideally so I can identify containment or exact values
  3. I need an API which will feel intuitive. A hard endeavour indeed as intuitive to me isn’t necessarily intuitive to you.

In order to address these points as well as possible I chose to emulate existing patterns I’m already familiar with and which are familiar to people working in a similar context to mine.

The first pattern to emulate was, obviously, the tools we we’re already using - jq and grep.

The command-line API for my tool is as close to that of grep as possible. The various flags, such as -c (print count) and -i (ignore case), are the same. The ability to either pipe input into jg or specify a specific file are supported too.

You get the idea.

As we saw earlier, using regular expressions doesn’t make sense when trying to describe the shape of JSON input. jq solves this by specifying a syntax of filters which allows you to dig deep into JSON objects. Instinctively you might think I should adopt the jq syntax but I actually decided against that for two reasons.

The first reason was that these filters are quite clearly geared towards slicing apart a JSON objects, rather than matching to them. This makes the syntax quite complex at times and unfit for my purposes.

The second reason is that, to be blunt, I don’t find the jq syntax intuitive and always find myself looking things up in the documentation no matter how often I use the tool.

Instead I chose to emulate another syntax which I have been using to match shapes for many years and am highly familiar with: CSS Selectors.

While the structure of HTML Documents, which is what CSS Selectors are optimised for, is very differently from JSON structures, I could see enough similarities and I felt it would be quite intuitive to people who are used to both JSON and CSS, such as the web development community.

The project’s readme documents the entire syntax in detail. I’d appreciate feedback on this syntax, so please do tell me how intuitive you actually find it.

Wait… why Jeff Goldblum? What does he have to do with JSON? Well, I originally named the project jgrep, but that name and similar variants are taken by similar tools. But then I thought… json grep could be shortened to jg… and jg is obviously Jeff Goldblum… And Jeff is awesome, so obviously jg would have to be named after him.

Using jg To Answer The Question

So how would you use jg to answer the question I asked above? Can we find all the bid responses which include bids on a specific Private Marketplace Deal, have an Ad Id on them and can be skipped?”

Like so:

$ cat bidresponse.json | jg '{"dealid":"UNRX-GIDI-9d2a2ba53ebc"}' | jg '.adid'  | jg '{"attr":[16]}' 
{"id":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","seatbid":[{"bid":[{"id":"1","impid":"300eb402-3a4d-3486-b2e3-9d2a2ba53ebc","price":8.49,"adid":"8756987","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[16]}],"seat":"10000001"}],"cur":"USD"}
{"id":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","seatbid":[{"bid":[{"id":"1","impid":"cd56c8ce-7ed6-49b6-84f5-689dd5315196","price":7.49,"adid":"8762655","adm":"<?xml version=\"1.0\" encoding=\"UTF-8\"?><VAST version=\"2.0\"></VAST>","adomain":["gidi.io"],"dealid":"UNRX-GIDI-9d2a2ba53ebc","attr":[15, 16]}],"seat":"10000001"}],"cur":"USD"}

As you can see in this example, just like grep, we can combine searches by piping jg commands into one another. Effectively this narrows our results down by filtering down the results using one pattern and then handing those over to another search with another pattern.

If, on the other hand, you want to print events that match one of multiple patterns (an OR relationship, instead of AND) then you can use the -e flag… again, just like grep.

So for example, if you wanted to find bid responses that either have an Ad ID or have the specified deal ID, you could use the following:

$ jg -e '{"dealid":"UNRX-GIDI-9d2a2ba53ebc"}' -e '.adid' bidresponse.json

Using jg On Bid Requests

To better showcase the benefit of using jg lets take a look at openRTB’s Bid Request, which is a somewhat more complex entity than the Bid Response.

Bid Requests are an entity that the Ad Exchange sends to potential buyers in order to offer them the opportunity of buying advertising real-estate. We’re talking about a structure that can be vastly different depending on the ad format on offer, the type of the real-estate and a variety of other factors.

For example, here is a Bid Request offering the opportunity of purchasing space on a publisher’s site in which a video ad could be displayed in one of those annoying floating spaces.

{
	"id": "521fc063-7db1-4bd3-a6f2-d457d46f8c3e",
	"imp": [{
		"id": "521fc063-7db1-4bd3-a6f2-d457d46f8c3e",
		"video": {
			"mimes": ["video/mp4", "application/javascript"],
			"minduration": 1,
			"maxduration": 7200,
			"protocols": [2, 5, 3, 6],
			"w": 375,
			"h": 603,
			"startdelay": 0,
			"linearity": 1,
			"sequence": 1,
			"boxingallowed": 1,
			"api": [2],
			"placement": 5
		},
		"displaymanager": "in_article",
		"displaymanagerver": "1.0",
		"instl": 1,
		"bidfloor": 5.82,
		"bidfloorcur": "USD",
		"secure": 1,
		"pmp": {
			"deals": [{
				"id": "UNRX-GIDI-1",
				"bidfloor": 4.00,
				"bidfloorcur": "USD",
				"at": 2
			}, {
				"id": "UNRX-GIDI-2",
				"bidfloor": 4.00,
				"bidfloorcur": "USD",
				"at": 2
			}],
			"private_auction": 0
		}
	}],
	"device": {
		"ua": "Mozilla/5.0 (iPhone; CPU iPhone OS 8_4_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) GSA/8.0.57838 Mobile/12H321 Safari/600.1.4",
		"ip": "00.000.00.000",
		"geo": {
			"country": "GBR"
		},
		"devicetype": 1
	},
	"user": {
		"ext": {
			"consent": ""
		}
	},
	"site": {
		"id": "08700a49-c92d-4b90-9f76-c7f2d514f86b",
		"domain": "www.gidi.io",
		"page": "https://www.gidi.io/",
		"ref": "https://www.google.co.uk/"
	},
	"at": 2,
	"allimps": 0,
	"bcat": ["IAB25-1", "IAB25-2", "IAB25-3", "IAB25-4", "IAB25-5", "IAB26-1", "IAB26-2", "IAB26-3", "IAB26-4", "IAB9-9"],
	"badv": ["noads.com"],
	"regs": {
		"ext": {
			"gdpr": 1
		}
	}
}

This is a relatively simple Bid Request if you can believe it.

Bellow are a variety of example jg cli calls that would match the above bid request. If you’re unsure how the pattern works I suggest looking at the shape of the Bid Request and finding the specified fields in the structure, hopefully it will make more sense then.

The following will match any JSON object that has an object under the key “site” which has a key “ref”:

$ jg '.site.ref' bidrequests.json

The following will match any JSON object that has a key “country” whose value is “GBR”, meaning the user is located in Britain:

$ jg '{"country":"GBR"}' bidrequests.json

The following will match any JSON object that has a key “pmp” followed by a key “deals” whose value is a non-empty array:

$ jg '.pmp.deals[.]' bidrequests.json

Following from the last example, this will only match if the array of deals has a deal with the id “UNRX-GIDI-2”:

$ jg '.pmp.deals[{"id":"UNRX-GIDI-2"}]' bidrequests.json

More often than not I need to search for events which match several different things at once.

For example, what if I want to find all the Bid Requests which were offered for:

  1. A user using a mobile device (denoted by devicetype=1)
  2. The browser being used by the user is the Google Search App (denoted by GSA appearing in the User Agent)
  3. The user is located in Britain
  4. The space on offer for the Video Ad is a full screen placement (denoted by instl=1)
$ cat bidrequests.json |              
  jg '.device{"devicetype":1}' |   
  jg '.device{"ua"*:"GSA/"}' |     
  jg '{"country":"GBR"}' |         
  jg '.imp[{"instl":1}]'

Conclusion

The most important take away for me has been that structured logging opens up a whole world of options when you want to analyse what is going on in your system. But as I said earlier: **There isn’t much point having structured logs if you can’t reason about them in a structured manner in the place where you need them **.

Feel free to play around with jg and submit a pull request if you spot an error or have an improvement to offer.

For example, if you look at that last example I gave, you might have noticed that even though the devicetype and ua fields are on the same device object, you can’t match against both at the same time. Ideally you would be able to do something like jg '.device{"devicetype":1,"ua"*:"GSA/"}' which would definitely be more intuitive. This is just one example of a still missing feature which I’ll be more than happy to receive a pull request for.

I haven’t yet published the cli tool to Homebrew or APT, but I hope to do so once I feel the tool is mature enough. In the meantime you can download the latest release on Github.