“Parsing” in Python
I recently read “Parse, Don’t Validate” and liked it enough to share it with my coworkers and let it bring me out of retirement on lobste.rs. It captures a concept that I have struggled to explain, leading to many cases where I couldn’t say why I thought something was better beyond a vague “It’s a matter of taste.” That’s not very satisfactory as a justification when you’re trying to explain to someone why they should rework a piece of code in a review.
Still, as compelling as I find the explanation, it has two practical flaws: first, most of us aren’t writing Haskell every day. The message is more widely applicable than the language. I will offer explanations here with Python syntax and tools instead. Second, dynamically typed languages don’t offer static typechecking, so encoding facts in types is more limited. As such, I offer this rephrasing:
Any validation of data should produce an object you can trust without repeating that validation.
Consider the basic fact of JSON parsing in a typical API project. Your code
str object and, in most Python projects I have worked with,
dict with no further encoding of meaning.
def update_handler(request): body = json.loads(request.body)
This JSON parsing produces a more constrained object than a string. After all,
JSON at least has some concept of validity and structure, and you know that
you’ve got either Python
dict or an exception after this parse. Do you trust
def update_handler(request): body = json.loads(request.body) if new_title := body.get("title") is None: raise HTTPBadRequest("whoopsidoodle")
No, you don’t. Worse, although you’ve validated the input here, you haven’t
encoded the meaning in any way. This validation is local: only code that
directly follows it in the
update_handler function can trust that this piece
of data has been validated. Any other consumer must repeat the same check, and
it’s depressingly common for validations that work perfectly when they’re
initially written to fall through the cracks as further code is added that
relies on the data. “This ‘get item’ syntax needs to be replaced with a call to
.get” is one of my most frequent code review comments.
So what can you do? One of the first tools to reach for is just the language’s “type definition” tools. In Python, that means classes. If you convert external input to an internal class, you gain control over the parsing and validation of that input, and can choose how to handle cases where data is missing.
class Post: def __init__(self, body): self.title = update_body.get("title", "Untitled") self.author = ubdate_body.get("author", "Anonymous") ...
At least defaults are located in one place. It’s not very extensible or
adaptable, but if you have a
Post, then by gosh there’s going to be a
title.1 With the new
dataclasses in Python, you
can even make it a little more concise.
from dataclasses import dataclass @dataclass class Post: title: str = "Untitled" author: str = "Anonymous" ...
The drawback of this type of approach, where you handle default cases in a central place, is that you aren’t really capturing the meaning of what you received as input. For example, here’s no distinction, in this code, between someone intentionally naming a post “Untitled” and someone simply forgetting to name a post. What if, later, the default needs to change?
from dataclasses import dataclass @dataclass class Post: title: str = None author: str = None def has_title(self): return self.title is not None ...
Great, now we can determine that a title wasn’t
def update_handler(request): post = Post(**json.loads(request.body)) existing_post = request.dbsession.query(PostModel).get(request.params.id) if post.has_title(): existing_post.title = post.title ...
We’ve centralized this validation and eliminated the need to use
using a class, but nothing is making that title validation genuinely
unnecessary. How can we do that? One rule is that “smaller” data types are
easier to confidently operate on. Think of booleans versus strings. Thus, if we
can make the data types we’re operating on “weaker,” we can stop worrying about
types they can’t represent. So let’s back up.
What is our update handler trying to accomplish?
It needs to accept valid updates and reject invalid updates for a given model. What if we considered the update as our data structure instead?
@dataclass class FieldUpdate field_name: str value: Any # ``Any`` and ``Union`` are taken from the ``typing`` module. invalid_reason: Union[FieldErrors.Unknown, FieldErrors.Immutable, None] class ModelUpdate: def __init__(self, model_class, request_body): updates =  for field, value in request_body.items(): ... # We'd check whether the field is known and mutable here. updates.append(FieldUpdate(field, value, field_error)) self.updates = updates def apply(self, model_obj): if any(u.invalid_reason for u in self.updates): raise InvalidUpdateError(...) for update in self.updates: setattr(model_obj, update.field, update.value)
The code is a lot more indirect (the word “post” doesn’t appear anywhere in these classes), but calling code becomes more readable, and updates for any model that followed a similar pattern would be identical.
def update_handler(request): post = request.dbsession.query(Post).get(request.params.id) ModelUpdate(Post, json.loads(request.body)).apply()
In the real world, of course, validations at the edge of your system are more
complex. We wouldn’t pass in raw JSON input to a class like
Instead, we could use tools like
centralize logic around deserialization.2
Boiling it down to a few rules: put validation at the edge of the system. Produce a data type that you can trust without further validation. Every instance of that data type should be semantically valid, in terms of business logic, and no operations on it should produce invalid values.
Last, and most crucially, don’t be afraid to create new, more specific data types for your specific use cases. It’s okay to represent different data, used for different purposes, with different data structures, and makes later generalization easier!