GWT / GAE development blog Articles about GWT/GAE and the development of TeamScape: A sports team portal

AppEngine datastore

Previous: Introduction

This first section of interview questions focuses on the AppEngine datastore and its offering. The idea with this section is that developers that are considering AppEngine gets a chance to hear what some experienced developers has to say about the datastore.

What are the biggest strengths of the AppEngine datastore?

Jeff Schnitzer (objectify)

  • That someone else manages it for you.  Server maintenance, backups, capacity, hardware failures, etc are all someone else's problem.
  • That it's schemaless.  It's actually pretty easy to make structural changes to your data on-the-fly with no downtime. Despite some significant schema changes, I've only taken Mobcast (my "day job") offline once, and that had nothing to do with the datastore.
  • That scaling the datastore is not an issue.  I feel pretty secure knowing that even if 10 million users all switch on my application tomorrow, they will all get about the same experience that users get today.  It means that I can spend every day writing new features instead of making sure the old ones continue to work.

John Patterson (Twig)

  • No need to administer you own servers - Hard problems like clustering, security and hardware maintenance are taken off your hands.
  • Scalability - Confidence that as your app grows in popularity you will have very few problems keeping pace. The limitations actually prohibit architectural practices that do not scale.
  • Price - –Free to get started and investigate whether the platform suits your requirements. Very competitively priced as your resource demands increase.

Ignacio Coloma (SimpleDS)

You can deploy today, test your application and then turn your cell phone off for the weekend, knowing that if it goes down there will be an entire team working hard to get it back online again. It also has a really simple, straight-to-the-point set of features.

What about biggest weaknesses?

Jeff Schnitzer (objectify)

My biggest complaint, by far, is the lack of ad-hoc queries. In an RDBMS world you get used to firing up a SQL prompt and asking certain kinds of arbitrary questions about your data. It may take an hour to table scan a billion records, but you eventually get your answer.

The GAE datastore is great for operational systems, but it's not very convenient for analytics. You can iterate through your dataset 30s at a time using the task queue, but even simple questions require a fair amount of code.

Perhaps the looming map/reduce feature will address this issue, but it's hard to imagine it being as convenient as a SQL prompt. A SQL engine that you can run on top of an existing dataset and perform complex queries, no matter how slowly, would be extraordinarily useful. Unfortunately the folks building Cloud2db don't seem to be thinking along these lines - their solution is much more of an "all or nothing" approach.

My second biggest complaint is that the datastore is actually kinda slow. I read boasts that applications on Cassandra, Tokyo Tyrant, and MongoDB get more than 10k writes per second *per instance*. I'm lucky to get 100 writes per second in a GAE request.  Two orders of magnitude difference!

Other popular complaints are the lack of usable transactions and the lack of (automatic) joins. Not every application needs these things, so I'm more inclined to call them "constraints" rather than "weaknesses".

John Patterson (Twig)

The datastore requires a complete change in mindset and a new toolset.  Not having the ability to run complex queries means you need to pre-calculate results that you would be used to querying on the fly.

Frequent loading requests – traditional frameworks such as Wicket and Spring assume they can do work upfront to reduce per request effort.
Engine apps need to start fast and amortise initialisation work over many request or users are left waiting during the frequent loading requests.  This is probably the issue that hits more projects than any other. Often developers get quite far through the development process before being shocked to realise that their initial framework choice has made their application unusable.

Transactions are very limited and your data model must be structured to use them in a particular way from the start.

No https support for your domain and nothing a developer can do to remedy this themselves – apart from going elsewhere for this functionality.  The more general issue is that if you need something the platform doesn't offer you cannot simply install it yourself.

Many Java libraries are incompatible - there is no SOAP RPC implementation that runs on the Engine.

Ignacio Coloma (SimpleDS)

It has a really simple set of features, which quite often makes the "not allowed to do" list rather long. This is where third-party libraries should extend the capabilities of GAE.

What are the problems with the JDO/JPA solutions?

Jeff Schnitzer (objectify)

The biggest problem is that they're clumsy and unwieldy.  JDO contains a bazillion annotations, configuration options, and query language structures to allow it to straddle every conceivable kind of datastore, from an RDBMS to a object databases.  JPA is tailored specifically to the feature set of an RDBMS.  The GAE datastore is something new... so JDO gets a whole new arsenal of special extensions and JPA just gets left out in the cold.

In the context of GAE, 80% of JDO and JPA are meaningless cruft, and the GAE-specific features that you *do* need have tortured syntax.  Do you really want to type things like @Extension(vendorName = "datanucleus", key = "gae.unindexed", value="true") everywhere? The syntax for batch gets - one of the most important operations on the datastore - is atrocious:

Query q = pm.newQuery("select from " + Book.class.getName() + " where :keys.contains(key));

Using JDO/JPA on GAE effectively requires climbing not only the learning curve for JDO/JPA (and figuring out which 20% works), but also learning how the datastore works and how to bend JDO/JPA around it.  It's just not worth it - and I say this as someone who has been working extensively with Hibernate and JPA since the days of Hibernate 1.0.

Aside from the general complexity issue, there are some specific problems with JDO/JPA on GAE.  GAE has some interesting nuances (parent keys & entity groups, partial indexes, collection properties) that don't have direct mappings in either JDO or JPA.  JPA is particularly troublesome; DataNucleus uses the bytecode enhancer, but JPA doesn't have a detach() method - so you can't serialize your entities, ever!

John Patterson (Twig)

JDO is a one-size-fits-all interface that really doesn't fit the Engine so well at all.  Doing simple things that are specific to the datastore becomes incredibly complex and not at all intuitive.  You will spend a large portion of your development time hunting for the right annotations to use or figuring out if a feature is implemented.  Often you will be disappointed - no support for polymorphism?  Come on, this is Java!

More generally adhering to a spec limits how the implementation can utilize the underlying features of the datastore.  In some cases this needs to unnecessarily terrible performance.  Storing Lists of instances must update every member Entity if the order is changed so that behaviour adheres to the spec.

Startup time is terrible.

Ignacio Coloma (SimpleDS)

With JDO you are going through an extensive API thought for the features present in a relational database, or worse, any kind of datastore. GAE has a limited nature that basically chops off all those features behind the curtains. To me, it would be more natural to learn the things that are allowed in GAE instead of all the things that are forbidden (which again, is a long list).

The Datanucleus JPA implementation has JDO underneath it, so you get Yet Another Layer to understand and debug.

Are there any situations where you would recommend a developer to use JDO/JPA rather than a third party framework?

Jeff Schnitzer (objectify)

I'm tempted to say:  "If you expect to port your system to another platform, you should probably use JDO or JPA." However, the first retort that pops into my mind is "If you expect to port your system to another platform, you should stop using GAE right now." There are quite a lot of issues involved in migrating a GAE app to another environment and the datastore is just one of them. If you aren't sure that GAE will work for you, keep reading the documentation until you *are* sure one way or another. You will save yourself a lot of pain later on.

It's not that JDO/JPA doesn't have some nice features (change detection and cascading deletes, for example), it's just that I don't think these features are worth the trouble.  On the other hand, your apps may have entirely different demands than my apps, and maybe to you those features are golden.

If you're new to Appengine, I would recommend developing a spike solution with Objectify, even for JDO/JPA experts that intend to develop with JDO/JPA fulltime.  Objectify will teach you how the datastore really works, knowledge that will arm you well for your future struggles.

John Patterson (Twig)

If you have more time and money than you know what to do with I suggest wasting both by using JDO/JPA.

If you have an existing app that already uses JDO or JPA you may get some reuse of your investment by sticking with them.  But in reality your data model will need to change significantly, many if not all of your queries will need to be restructured, transactions will not work in the same way ... basically you will need to rewrite in any case.

If you want to keep your options open to move your app to another platform I still would not recommend it.  You will not be able to reuse your data models because of all the Engine specific hacks and dependencies you are forced to depend on.

Twig data models are more portable than JDO models because they do are pure Pojos with no dependencies on the low-level datastore or JDO.  Think of them as the DTOs that developers often need to write to hide implementation details.  Its just that you persist these pure DTOs directly.

Some apps are just not possible to build with JDO-GAE due to its lack of performance optimisations.  My experiments with JDO showed it was not possible to get the performance I needed for my app to be practical.  The same logical query that could not complete within the 30 second deadline using JDO now runs in 250ms thanks to Twigs embedded collections and parallel queries.  Other users have found that queries they had to run as background tasks previously now run in real time.

Ignacio Coloma (SimpleDS)

People aiming to make applications independent from the datastore will find some value in JDO/JPA. This is a case that I have problems to imagine, since your application will end tied to AppEngine one way or the other, but the biggest amount of independence is going through JPA/JDO.

Which problems do you see with using the low-level API directly?

Jeff Schnitzer (objectify)

  • Untyped data structures.
  • Clunky query API.
  • Your data gets unceremoniously munged in odd ways (ie, all Integers go to Long, all collections to ArrayList).
  • Poor documentation.
  • It's simply not fun.

John Patterson (Twig)

You basically have to deal with bags of untyped values.  There is a reason we program in languages like Java with class constructs and not just Maps of Objects.

Transactions are handled inconsistently.  Due to historical reasons, all queries ignore the current transaction even if they include an ancestor filter.  This could not be changed without breaking existing low-level code.  Without that restriction Twig is free to always use the current transaction in every operation including queries.

If you have your own dynamic data models then they can map very well to low-level Entities.  But Twig can also be configured to store any type as Entities and Properties exactly as you want using its flexible PropertyTranslator chains.

Ignacio Coloma (SimpleDS)

There are no checks at all. You may query for a field that is not there anymore, or use a Key referencing the wrong kind. Of course, the biggest problem is using HashMaps to handle Entity data instead of Java classes.

There are a number of other frameworks currently available (objectify, Twig, SimpleDS, Slim3, Siena, cloud2db). If these existed when you decided to create your own framework, did you know of them? If so, why did these not live up to your requirements?

Jeff Schnitzer (objectify)

Yes, I reviewed all of these in fairly fine detail.  I would much rather build applications than tools.  I actually started working on patches to Siena in hopes of making it meet my needs, but Siena's cross-platform nature ran at odds with my need to squeeze every last ounce of performance out of the GAE datastore.

The biggest problem I had with the others is key management.  My entities have simple ids; many have natural integer keys (facebook ids, place ids).  I want to use simple POJO classes like this:

class Foo {
@Id long id;
String someData;
// ...etc
}

Placing the datastore Key class (or even a stringified version of the Key) in the entity carries redundant information and complicates working with the data.  JPA entities have simple ids, why can't mine? SimpleDS and Slim3 force the Key into the entity. Siena gets it right but can't handle entity groups.

Another issue was lack of deep support for generics.  It was tricky to weave a generic Key<?> class throughout Objectify's API, but the payoff is vastly more readable application code.

And then there are aesthetics.  Other than Siena, I just couldn't bring myself to love the other APIs.  Yes, it's subjective... but here are the money shots, you can judge for yourself:

http://objectify-appengine.googlecode.com/svn/trunk/javadoc/com/googlecode/objectify/Objectify.html
http://slim3.googlecode.com/svn/trunk/slim3/javadoc/org/slim3/datastore/Datastore.html
http://loom.sourceforge.net/docs/simpleds/javadoc/org/simpleds/EntityManager.html
http://code.google.com/p/twig-persist/source/browse/src/main/java/com/vercer/engine/persist/ObjectDatastore.java (no javadocs)

John Patterson (Twig)

Although Twig is the most recently released of the three interfaces discussed here, pre-release versions were made public that focused on getting the core functionality working well and fast.  Only in the recent final release were the user friendly fluent commands introduced with the benefit of seeing the strengths and weaknesses of the other API's as a base.

One key difference in the API's is the use of fluent Commands over a flat Query interface.  The difference is analogous to storing all your files in a single flat directory or organising them into folders.  This allows Twig to add lots of functionality that does not clutter the API and also leaves room for new functionality to be added over time. Designing these commands, the main thought process was “"is this obvious for someone who does not already know the low-level API"?”.  To this end, method invocations read like sentences:

datastore.find().type(Hotel.class)
.withAncestor(chain)
.fetchResultsBy(50)
.continueFrom(myCursor)
.start From(100)
.addFilter(...)
.returnResultsNow(); //   returnKeysLater() performs a parallel query.

Some other frameworks seem to value having short names to reduce key strokes as more important.

All the other frameworks make you write data models specifically designed for and dependant on the datastore.  Working with low-level Keys makes code messy and less maintainable so right from the start Twig was designed to handle Keys completely transparently.  This leaves your data models free of any dependencies on the low-level interface.

Ignacio Coloma (SimpleDS)

I searched for alternatives before starting up SimpleDS and couldn't find any. Anyway, I was looking for the simplest possible solution for GAE storage, not for a full JDO replacement.

My 2 cents

I had only worked with relational databases before I decided to try out AppEngine. I agree with all of the authors opinions here, but especially with John Patterson's comment about mindset. Not only did I go from standard relational SQL databases to BigTable, but also from standard PHP/HTML approach to Google Web Toolkit. Sometimes it feels like I would need behavioral therapy to;

  1. Stop thinking about my model relationships in relational/SQL terms.
  2. Stop thinking in terms of pages and page navigation when using GWT. There are no pages (well, one perhaps)! Embrace the dynamic world with GWT history!

I still love GWT + GAE though (well, besides the cold start time that is driving me a bit crazy). It's a great way for startups or hobby projects to get started without investing money in webservers, etc.

Next: The frameworks

  • Share/Bookmark