Deep Copying Django Model Instances, What Does It Mean?
UPDATE [2011-09-09]: The django-data-proxy library has been (quickly) superseded by https://github.com/cbmi/django-forkit which solves the more generic problem of creating shallow and deep copies of model instances. It recursively traverses all relationships during a deep fork and defers saving those new objects until the end. It also provides a useful
diff method for comparing objects of the same type.
I have recently come across a need for making a copy of a model instance. There are quite a few hits for solutions on how to create a copy or more interestingly a deepcopy of a model instance.
My particular use case for creating a copy was to create a temporary proxy of the model instance for a session. The idea is to allow for the proxy to be altered without affecting the original instance (saved in the database). Of course one could say.. "just alter the instance without saving it", but I wanted to be able to persist the changes across sessions.
Initial Solution: django-data-proxy
So, I wrote a simple abstract model class to implement a instance-proxy API. It is available on GitHub: https://github.com/cbmi/django-data-proxy. It implements common operations like:
proxy to define the initial proxy instance,
diff to return the difference between the original instance with the proxy,
reset to reset back to the original instance data,
push to apply the differences to the original object, and
clear to return the proxy to it's default values for the model.
Currently the API supports only shallow proxying, i.e basic local fields and local foreign keys. However, I have a stable branch (https://github.com/cbmi/django-data-proxy/tree/deep-proxy) that handles many-to-many and creating proxies for related objects which subclasses the
DataProxyModel. This is the beginnings of API for deep proxying.
Deep Proxy (Copy)
And then I started thinking… what does it mean to perform a deep proxy? What does it mean to perform a deep copy for relational data in general? Starting with a simple example:
class Author(models.Model): first_name = models.CharField(max_length=30) last_name = models.CharField(max_length=30) author = Author('John', 'Doe') author.save()
For basic model like this, a copying an instance is trivial. Simply set the primary key to
None and save:
author.pk = None author.save() author.pk # 2
When dealing with related objects, though, things become a bit ambiguous.
Let's add a
class Book(models.Model): title = models.CharField(max_length=50) author = models.ForeignKey(Author) summary = models.TextField() pub_date = models.DateField() book = Book('John Doe: The Autobiography', author, 'A long, long time ago..', date(2011, 9, 3)) book.save()
What happens to the relationship to
book is copied? It will still reference
author, which is expected. But, what would a deep copy of
It depends, kind of.
In this example it doesn't make sense to have two author entries for John Doe, but for some data models it makes sense to create copies or forks of other data. Social data that can be shared and then subsequently altered needs to be forked to ensure the source data is not changed. In this case, a true deep copy would make sense (distributed version control systems (DVCS) also come to mind).
That being said, from now on I am going say to fork for clarity, since deep copy is somewhat ambiguous.
Reverse Foreign Keys
What about reverse foreign keys e.g.
Book? That is simple, those objects are not included in the fork. Since the
Author model is unaware of the
Book model, it doesn't make sense to include the set of related books. (I would like for someone to share a data model that would argue against statement).
An one-to-one relationships is a foreign key, but with the added constraint. Let's update our
class Book(models.Model): author = models.OneToOneField(Author) ...
The constraint defines exclusivity between a single book and an author. That is, an author cannot be referenced by more than one book.
This complicates things a bit for a shallow copy. The
book copy cannot retain the reference to the same
author since the database would throw a unique constraint error. The simply solution for a shallow copy is to set any one-to-one relations to
None. If the field is nullable, then the copy can be saved, otherwise, as in this example, a new author would need to be defined and set on the book copy.
This is a bit awkward since it's not really a copy, but the unique constraint undermines the implementation.
The final relationship type is many-to-many. Let us update the
Book model one more time.
class Book(models.Model): authors = models.ManyToManyField(Author) ...
In either case, copy or fork, the set of authors for
book can be applied to the
authors = book.authors.all() book.pk = None book.save() book.authors = authors
This makes sense since an entry in the through table defines a unique constraint between
Author. For this reason, it can be assumed that a copy or fork will simply reassign the
authors set to the copy (shown above).
A caveat to this is handling custom
through models since there may be additional data that needs to be copied.
So what does this all mean?
Well first off, simply stating "I want to implement a deep copy for model instances" is not good enough without a real use case. It may not be necessary or not actually representative of a real deep copy. The bigger picture is the fact that handling related objects when attempting perform a deep copy (or proxy in my case) is not trivial. It is not as simple as ensuring nested mutable data structures are copied. I think https://github.com/cbmi/django-data-proxy contains a few solid first steps for solving this problem. If you feel like contributing in any way, you know the deal.