“Parsing” in Python

I recently read “Parse, Don’t Validate,” shared it with my coworkers, and let it bring me out of retirement on lobste.rs. It captures a concept that I’ve struggled to explain, leading to cases where I couldn’t say why I thought something was better beyond a vague “It’s a matter of taste.” That’s not very satisfactory as a justification when you’re trying to explain to someone why they should rework a piece of code in a review.

Still, as compelling as I find the explanation, it has two practical flaws: first, most of us aren’t writing Haskell every day; the message is more widely applicable than the language. I will offer explanations here with Python syntax and tools instead. Second, dynamically typed languages don’t offer static typechecking, so encoding facts in types is more limited. As such, I offer this rephrasing:

Any validation of data should produce an object you can trust without repeating that validation.

Consider the JSON parsing in a typical API project. Your code receives a str object and, in most Python projects I’ve worked with, produces a dict with no further encoding of meaning.

def update_handler(request):
    body = json.loads(request.body)

This produces a more constrained object than a string. After all, JSON at least has some concept of validity and structure, and you know that you’ve got either Python dict or an exception after this parse. Do you trust it, though?

def update_handler(request):
    body = json.loads(request.body)

    if new_title := body.get("title") is None:
        raise HTTPBadRequest("whoopsidoodle")

No, you don’t. Worse, although you’ve validated the input here, you haven’t encoded the meaning in the result. This validation is local: only code that directly follows it in the update_handler function can trust that this piece of data has been validated. Other consumers must repeat the same check, and it’s depressingly common for validations that work perfectly when they’re initially written to fall through the cracks as more code is written. “This ‘get item’ syntax needs to be replaced with a call to .get” is one of my most frequent code review comments.

So what can you do? One of the first tools to reach for is just the language’s “type definition” tools. In Python, that means classes. If you convert external input to an internal class, you gain control over the parsing and validation of that input, and can choose how to handle cases where data is missing.

class Post:
    def __init__(self, body):
        self.title = update_body.get("title", "Untitled")
        self.author = ubdate_body.get("author", "Anonymous")
        ...

At least defaults are located in one place. It’s not very extensible or adaptable, but if you have a Post, then by gosh there’s going to be a title.¹ With the new dataclasses in Python, you can even make it a little more concise.

from dataclasses import dataclass

@dataclass
class Post:
    title: str = "Untitled"
    author: str = "Anonymous"
    ...

The drawback of this type of approach, where you handle default cases in a central place, is that you aren’t really capturing the meaning of what you received as input. For example, here’s no distinction, in this code, between someone intentionally naming a post “Untitled” and someone simply forgetting to name a post. What if, later, the default needs to change?

from dataclasses import dataclass

@dataclass
class Post:
    title: str = None
    author: str = None

    def has_title(self):
        return self.title is not None

    ...

Great, now we can determine that a title wasn’t supplied—but this pushes the burden back onto the consumer once again, as discussed in the original article:

def update_handler(request):
    post = Post(**json.loads(request.body))
    existing_post = request.dbsession.query(PostModel).get(request.params.id)

    if post.has_title():
        existing_post.title = post.title

    ...

We’ve centralized this validation and eliminated the need to use dict.get by using a class, but nothing is making that title validation genuinely unnecessary. How can we do that? One rule is that “smaller” data types are easier to confidently operate on. Think of booleans versus strings. Thus, if we can make the data types we’re operating on “weaker,” we can stop worrying about types they can’t represent. So let’s back up.

What is our update handler trying to accomplish?

Accept valid updates for a given model.
Reject invalid updates for a given model.

What if we considered the update as our data structure instead?

@dataclass
class FieldUpdate:
    field_name: str
    value: Any  # ``Any`` and ``Union`` are taken from the ``typing`` module.
    invalid_reason: Union[FieldErrors.Unknown, FieldErrors.Immutable, None]

class ModelUpdate:
    def __init__(self, model_class, request_body):
        updates = []

        for field, value in request_body.items():
            ...  # We'd check whether the field is known and mutable here.
            updates.append(FieldUpdate(field, value, field_error))

        self.updates = updates

    def apply(self, model_obj):
        if any(u.invalid_reason for u in self.updates):
            raise InvalidUpdateError(...)

        for update in self.updates:
            setattr(model_obj, update.field, update.value)

The code is less direct (the word “post” doesn’t appear anywhere in these classes), but calling code becomes more readable, and updates for any model that followed a similar pattern would be identical.

def update_handler(request):
    post = request.dbsession.query(Post).get(request.params.id)
    ModelUpdate(Post, json.loads(request.body)).apply(post)

In the real world, of course, validations at the edge of your system are more complex. We wouldn’t pass in raw JSON input to a class like ModelUpdate. Instead, we could use tools like marshmallow and marshmallow-dataclass to centralize logic around deserialization.²

Boiling it down to a few rules:

Put validation at the edge of the system.
Produce a data type that you can trust without further validation.
Every instance of that data type should be semantically valid, in terms of business logic, and no operations on it should produce invalid values.
Last, don’t be afraid to create new, more specific data types for your use cases. It makes later generalization easier.

Unless someone has overwritten it, the paranoid programmer ponders. ↩︎
I will cover marshmallow and marshmallow-dataclass in a future post. There is also a related project, marshmallow-sqlalchemy, but I don’t recommend it because I think it conflates concerns. ↩︎