How to Use Python Data Classes in 2023 (A Beginner’s Guide)
In Python, a data class is a class that is designed to only hold data values. They aren't different from regular classes, but they usually don't have any other methods. They are typically used to store information that will be passed between different parts of a program or a system.
However, when creating classes to work only as data containers, writing the __init__
method repeatedly can generate a great amount of work and potential errors.
The dataclasses
module, a feature introduced in Python 3.7, provides a way to create data classes in a simpler manner without the need to write methods.
In this article, we'll see how to take advantage of this module to quickly create new classes that already come not only with __init__
, but several other methods already implemented so we don't need to implement them manually. Also, we can do that with just a few lines of code.
We expect you to have some intermediate python experience, including an understanding of how to create classes and object-oriented programming in general.
Using the dataclasses
Module
As a starting example, let's say we're implementing a class to store data about a certain group of people. For each person, we'll have attributes such as name, age, height, and email address. This is what a regular class looks like:
class Person():
def __init__(self, name, age, height, email):
self.name = name
self.age = age
self.height = height
self.email = email
If we use the dataclasses
module, however, we need to import dataclass
to use it as a decorator in the class we're creating. When we do that, we no longer need to write the init function, only specify the attributes of the class and their types. Here's the same Person
class, implemented in this way:
from dataclasses import dataclass
@dataclass
class Person():
name: str
age: int
height: float
email: str
We can also set default values to the class attributes:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str = '[email protected]'
print(Person())
Person(name='Joe', age=30, height=1.85, email='[email protected]')
As a reminder, Python doesn't accept a non-default attribute after default in both class and functions, so this would throw an error:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5540/741473360.py in <module>
1 @dataclass
----> 2 class Person():
3 name: str = 'Joe'
4 age: int = 30
5 height: float = 1.85
~\anaconda3\lib\dataclasses.py in dataclass(cls, init, repr, eq, order, unsafe_hash, frozen)
1019
1020 # We're called as @dataclass without parens.
-> 1021 return wrap(cls)
1022
1023
~\anaconda3\lib\dataclasses.py in wrap(cls)
1011
1012 def wrap(cls):
-> 1013 return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
1014
1015 # See if we're being called as @dataclass or @dataclass().
~\anaconda3\lib\dataclasses.py in _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
925 if f._field_type in (_FIELD, _FIELD_INITVAR)]
926 _set_new_attribute(cls, '__init__',
--> 927 _init_fn(flds,
928 frozen,
929 has_post_init,
~\anaconda3\lib\dataclasses.py in _init_fn(fields, frozen, has_post_init, self_name, globals)
502 seen_default = True
503 elif seen_default:
--> 504 raise TypeError(f'non-default argument {f.name!r} '
505 'follows default argument')
506
TypeError: non-default argument 'email' follows default argument
Once the class is defined, it's easy to instantiate a new object and access its attributes, just like with a standard class:
person = Person('Joe', 25, 1.85, '[email protected]')
print(person.name)
Joe
So far we've used regular data types like string, integer, and float; we can also combine dataclass
with the typing
modules to create attributes of any kind in the class. For instance, let's add a house_coordinates
attribute to the Person
:
from typing import Tuple
@dataclass
class Person():
name: str
age: int
height: float
email: str
house_coordinates: Tuple
print(Person('Joe', 25, 1.85, '[email protected]', (40.748441, -73.985664)))
Person(name='Joe', age=25, height=1.85, email='[email protected]', house_coordinates=(40.748441, -73.985664))
Following the same logic, we can create a data class to hold multiple instances of the Person
class:
from typing import List
@dataclass
class People():
people: List[Person]
Notice that the people
attribute in the People
class is defined as a list of instances of the Person
class. For example, we could instantiate an object of People
like this:
joe = Person('Joe', 25, 1.85, '[email protected]', (40.748441, -73.985664))
mary = Person('Mary', 43, 1.67, '[email protected]', (-73.985664, 40.748441))
print(People([joe, mary]))
People(people=[Person(name='Joe', age=25, height=1.85, email='[email protected]', house_coordinates=(40.748441, -73.985664)), Person(name='Mary', age=43, height=1.67, email='[email protected]', house_coordinates=(-73.985664, 40.748441))])
This allows us to define the attribute as being any type we want, but also a combination of data types.
Representation and Comparisons
As we mentioned earlier, dataclass
implements not only the __init__
method, but several others, including the __repr__
method. In a regular class, we use this method to display a representation of an object in the class.
For instance, we'd define the method as in the example below when we call the object:
class Person():
def __init__(self, name, age, height, email):
self.name = name
self.age = age
self.height = height
self.email = email
def __repr__(self):
return (f'{self.__class__.__name__}(name={self.name}, age={self.age}, height={self.height}, email={self.email})')
person = Person('Joe', 25, 1.85, '[email protected]')
print(person)
Person(name=Joe, age=25, height=1.85, [email protected])
When using dataclass
, however, there's no need to write any of that:
@dataclass
class Person():
name: str
age: int
height: float
email: str
person = Person('Joe', 25, 1.85, '[email protected]')
print(person)
Person(name='Joe', age=25, height=1.85, email='[email protected]')
Notice that without all that code, the output is equivalent to the one from the standard Python class.
We can always overwrite it if we want to customize the representation of our class:
@dataclass
class Person():
name: str
age: int
height: float
email: str
def __repr__(self):
return (f'''This is a {self.__class__.__name__} called {self.name}.''')
person = Person('Joe', 25, 1.85, '[email protected]')
print(person)
This is a Person called Joe.
Notice that the output of the representation is customized.
When it comes to comparisons, the dataclasses
module makes our lives easier. For example, we can directly compare two instances of a class just like this:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str = '[email protected]'
print(Person() == Person())
True
Notice that we used default attributes to make the example shorter.
In this case, the comparison is valid because the dataclass
creates behind the scenes an __eq__
method, which performs the comparison. Without the decorator, we'd have to create this method ourselves.
The same comparison would result in a different outcome if using a standard Python class, even though the classes are in fact equal to each other:
class Person():
def __init__(self, name='Joe', age=30, height=1.85, email='[email protected]'):
self.name = name
self.age = age
self.height = height
self.email = email
print(Person() == Person())
False
Without the use of the dataclass
decorator, that class doesn't test whether two instances are equal. So, by default, Python will use the object's id
to make the comparison, and, as we see below, they are different:
print(id(Person()))
print(id(Person()))
1734438049008
1734438050976
All this means that we'd have to write an __eq__
method that makes this comparison:
class Person():
def __init__(self, name='Joe', age=30, height=1.85, email='[email protected]'):
self.name = name
self.age = age
self.height = height
self.email = email
def __eq__(self, other):
if isinstance(other, Person):
return (self.name, self.age,
self.height, self.email) == (other.name, other.age,
other.height, other.email)
return NotImplemented
print(Person() == Person())
True
Now we see the two objects are equal to each other, but we had to write more code to get this result.
The @dataclass
Parameters
As we saw above, when using the dataclass
decorator, the __init__
, __repr__
, and __eq__
methods are implemented for us. The creation of all these methods is set by the init
, repr
, and eq
parameters of dataclass
. These three parameters are True
by default. If one of them is created inside the class, then the parameter is ignored.
However, we have other parameters of dataclass
that we should look at before moving on:
order
: enables sorting of the class as we'll see in the next section. The default isFalse
.frozen
: WhenTrue
, the values inside the instance of the class can't be modified after it's created. The default isFalse
.
There are a few other methods that you can check in the documentation.
Sorting
When working with data, we often need to sort values. In our scenario, we may want to sort our different people based on some attribute. For that, we'll use the order
parameter of the dataclass
decorator mentioned above which enables sorting in the class:
@dataclass(order=True)
class Person():
name: str
age: int
height: float
email: str
When the order
parameter is set to True
, it automatically generates the __lt__
(less than), __le__
(less or equal), __gt__
(greater than), and __ge__
(greater or equal) methods used for sorting.
Let's instantiate our joe
and mary
objects to see if one is greater than the other:
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
print(joe > mary)
False
Python tells us that joe
is not greater than mary
, but based on what criteria? The class compares the objects as tuples containing their attributes, like this:
print(('Joe', 25, 1.85, '[email protected]') > ('Mary', 43, 1.67, '[email protected]'))
False
As the letter "J" comes before "M", it says the joe < mary
. If the names were the same, it would move to the next element in each tuple. As it is, it's comparing the objects alphabetically. Although that can make some sense depending on the problem we're dealing with, we want to be able to control how the objects will be sorted.
To achieve that, we'll take advantage of two other features of the dataclasses
module.
The first is the field
function. This function is used to customize one attribute of a data class individually, which allows us to define new attributes that will depend on another attribute and will only be created after the object is instantiated.
In our sorting problem, we'll use field
to create a sort_index
attribute in our class. This attribute can only be created after the object is instantiated and is what dataclasses
uses for sorting:
from dataclasses import dataclass, field
@dataclass(order=True)
class Person():
sort_index: int = field(init=False, repr=False)
name: str
age: int
height: float
email: str
The two arguments that we passed as False
state that this attribute isn't in the __init__
and that it shouldn't be displayed when we call __repr__
. There are other parameters in the field
function that you can check in the documentation.
After we've referenced this new attribute, we'll use the second new tool: the __post_int__
method. As it goes by the name, this method is executed right after the __init__
method. We'll use __post_int__
to define the sort_index
, right after the creation of the object. As an example, let's say we want to compare people based on their age. Here's how:
@dataclass(order=True)
class Person():
sort_index: int = field(init=False, repr=False)
name: str
age: int
height: float
email: str
def __post_init__(self):
self.sort_index = self.age
If we make the same comparison, we know that Joe is younger than Mary:
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
print(joe > mary)
False
If we wanted to sort people by height, we'd use this code:
@dataclass(order=True)
class Person():
sort_index: float = field(init=False, repr=False)
name: str
age: int
height: float
email: str
def __post_init__(self):
self.sort_index = self.height
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
print(joe > mary)
True
Joe is taller than Mary. Notice that we set sort_index
as a float
.
We were able to implement sorting in our data class without the need to write multiple methods.
Working with Immutable Data Classes
Another parameter of @dataclass
that we mentioned above is frozen
. When set to True
, frozen
doesn't allow us to modify the attributes of an object after it's created.
With frozen=False
, we can easily perform such modification:
@dataclass()
class Person():
name: str
age: int
height: float
email: str
joe = Person('Joe', 25, 1.85, '[email protected]')
joe.age = 35
print(joe)
Person(name='Joe', age=35, height=1.85, email='[email protected]')
We created a Person
object and then modified the age
attribute without any problems.
However, when set to True
, any attempt to modify the object throws an error:
@dataclass(frozen=True)
class Person():
name: str
age: int
height: float
email: str
joe = Person('Joe', 25, 1.85, '[email protected]')
joe.age = 35
print(joe)
---------------------------------------------------------------------------
FrozenInstanceError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5540/2036839054.py in <module>
8 joe = Person('Joe', 25, 1.85, '[email protected]')
9
---> 10 joe.age = 35
11 print(joe)
<string> in __setattr__(self, name, value)
FrozenInstanceError: cannot assign to field 'age'
Notice that the error message states FrozenInstanceError
.
There's a trick that can modify the value of the immutable data class . If our class contains a mutable attribute, this attribute can change even though the class is frozen. This may seem like it doesn't make sense, but let's look at an example.
Let's recall the People
class that we created earlier in this article, but now let's make it immutable:
@dataclass(frozen=True)
class People():
people: List[Person]
@dataclass(frozen=True)
class Person():
name: str
age: int
height: float
email: str
We then create two instances of the Person
class and use them to create an instance of People
that we'll name two_people
:
joe = Person('Joe', 25, 1.85, '[email protected]')
mary = Person('Mary', 43, 1.67, '[email protected]')
two_people = People([joe, mary])
print(two_people)
People(people=[Person(name='Joe', age=25, height=1.85, email='[email protected]'), Person(name='Mary', age=43, height=1.67, email='[email protected]')])
The people
attribute in the People
class is a list. We can easily access the values in this list in the two_people
object:
print(two_people.people[0])
Person(name='Joe', age=25, height=1.85, email='[email protected]')
So, even though both Person
and People
classes are immutable, the list is not, which means we can change the values in it:
two_people.people[0] = Person('Joe', 35, 1.85, '[email protected]')
print(two_people.people[0])
Person(name='Joe', age=35, height=1.85, email='[email protected]')
Notice that the age is now 35.
We didn't change the attributes of any object of the immutable classes, but we replaced the first element of the list with a different one, and the list is mutable.
Keep in mind that all the attributes of the class should also be immutable in order to safely work with immutable data classes.
Inheritance with dataclasses
The dataclasses
module also supports inheritance, which means we can create a data class that uses the attributes of another data class. Still using our Person
class, we'll create a new Employee
class that inherits all the attributes from Person
.
So we have Person
:
@dataclass(order=True)
class Person():
name: str
age: int
height: float
email: str
And the new Employee
class:
@dataclass(order=True)
class Employee(Person):
salary: int
departament: str
Now we can create an object of the Employee
class using all the attributes of the Person
class:
print(Employee('Joe', 25, 1.85, '[email protected]', 100000, 'Marketing'))
Employee(name='Joe', age=25, height=1.85, email='[email protected]', salary=100000, departament='Marketing')
From now on we can use everything we saw in this article in the Employee
class as well.
Take note of the default attributes. Let's say we have default attributes in Person
, but not in Employee
. This scenario, as in the code below, raises an error:
@dataclass
class Person():
name: str = 'Joe'
age: int = 30
height: float = 1.85
email: str = '[email protected]'
@dataclass(order=True)
class Employee(Person):
salary: int
departament: str
print(Employee('Joe', 25, 1.85, '[email protected]', 100000, 'Marketing'))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_5540/1937366284.py in <module>
9
10 @dataclass(order=True)
---> 11 class Employee(Person):
12 salary: int
13 departament: str
~\anaconda3\lib\dataclasses.py in wrap(cls)
1011
1012 def wrap(cls):
-> 1013 return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
1014
1015 # See if we're being called as @dataclass or @dataclass().
~\anaconda3\lib\dataclasses.py in _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
925 if f._field_type in (_FIELD, _FIELD_INITVAR)]
926 _set_new_attribute(cls, '__init__',
--> 927 _init_fn(flds,
928 frozen,
929 has_post_init,
~\anaconda3\lib\dataclasses.py in _init_fn(fields, frozen, has_post_init, self_name, globals)
502 seen_default = True
503 elif seen_default:
--> 504 raise TypeError(f'non-default argument {f.name!r} '
505 'follows default argument')
506
TypeError: non-default argument 'salary' follows default argument
If the base class has default attributes, all the attributes in the class derived from it must have default values too.
Conclusion
In this article, we saw how the dataclasses
module is a very powerful tool to create data classes in a quick, intuitive way. Although we've seen a lot in this article, the module contains many more tools, and there's always more to learn about it.
So far, we've learned how to:
-
Define a class using
dataclasses
-
Use default attributes and their rules
-
Create a representation method
-
Compare data classes
-
Sort data classes
-
Use inheritance with data classes
-
Work with immutable data classes