To batch or not to batch

Batch applications are quite common in IT systems: perhaps you won’t have to write a whole batch application in your developper career but there are many chances you’ll have some batch parts in your Web or desktop applications. Batch is about handling high volumes of data and a lot of things can go wrong or be tricky when it comes to batch: bad performances, very high memory footprint, complex recovery scenarios to avoid stopping a whole batch because of one bad item, etc. This article covers through a simple use case different approaches to tackle with batch applications. By comparing the runtime behavior of the approaches, we’ll see the benefits on relying a batch framework like Spring Batch.

The use case : importing XML data into a database

The use case consists in importing data (contacts) from an XML file into a database. Here’s a sample of the data to be imported:

  1. <?xml version=« 1.0 » encoding=« UTF-8 »?>
  2. <contacts>
  3. <contact>
  4. <firstname>De-Anna</firstname>
  5. <lastname>Raghunath</lastname>
  6. <birthDate>2010-03-04 12:06:45.99 CET</birthDate>
  7. </contact>
  8. <contact>
  9. <firstname>Susy</firstname>
  10. <lastname>Hauerstock</lastname>
  11. <birthDate>2010-03-04 12:06:45.99 CET</birthDate>
  12. </contact>
  13. <contact>
  14. <firstname>Kiam</firstname>
  15. <lastname>Whitehurst</lastname>
  16. <birthDate>2010-03-04 12:06:45.99 CET</birthDate>
  17. </contact>
  18. </contacts>

This is quite a simple use case, but it could be the first step of a real batch application: you import data into a database to leverage the query fea
tures of the database engine (e.g. aggregates), to export the data under a « digested » version. We’ll test the batch with XML files of different sizes: 100 contacts (14 Kb), 1000 contacts (123 Kb), 5000 contacts (614 Kb), 10000 contacts (1.2 Mb), 100000 contacts (12 Mb), 1000000 contacts (120 Mb).

The 3 approaches

What we really want to compare is a « simple » approach, without any support from a technical framework and a approach where we use the Spring Batch project for its importing features, but also for the infrastructure it provides for batch applications. We’ll divide the « simple » approach into 2 sub-approaches : the first one will use one database transaction for the whole import (batch size = n) and the second one a database transaction for each row (batch size = 1). Why sticking to these simple strategies? Mainly to emphasize that these strategies are the most commonly used when we write batch applications from scratch, because they are the simplest to implement. We’ll see then that we can easily choose the batch size with Spring Batch.

Approach #1 : simple batch (batch size = n)

The first approach is made of the following steps :

  • load the XML file into memory with DOM
  • open a transaction
  • iterate over the DOM tree to extract each contact and insert them into the database

Here is an excerpt of this implementation :

  1. public class importContacts {
  2. ()
  3. public void run() throws Exception {
  4. DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  5. DocumentBuilder builder = factory.newDocumentBuilder();
  6. Document document = builder.parse(input.getInputStream());
  7. final DateFormat dateFormat = new SimpleDateFormat(« yyyy-MM-dd HH:mm:ss »);
  8. final NodeList contacts = document.getElementsByTagName(« contact »);
  9. transactionTemplate.execute(new TransactionCallbackWithoutResult() {
  10. @Override
  11. protected void doInTransactionWithoutResult(TransactionStatus ts) {
  12. for (int i = 0, n = contacts.getLength(); i < n; i++) {
  13. final Element contact = (Element) contacts.item(i);
  14. try {
  15. jdbcTemplate.update(
  16. « insert into contact (firstname,lastname,birthdate) values (?,?,?) »,
  17. contact.getElementsByTagName(« firstname »).item(0).getTextContent(),
  18. contact.getElementsByTagName(« lastname »).item(0).getTextContent(),
  19. dateFormat.parse(contact.getElementsByTagName(« birthDate »).item(0).getTextContent())
  20. );
  21. } catch (Exception e) {
  22. throw new RuntimeException(e);
  23. }
  24. }
  25. }
  26. });
  27. }
  28. }

Perhaps this implementation is the most straightforward and would be the one used for a batch with low requirements in terms of robustness and scalability. Here are some of the drawbacks of this approach :

  • the XML file is loaded into memory. This can become problematic if its size grows (100s of MB or even more). This will take a lot of memory and could prevent several instances of the batch to run simultaneously.
  • poor exception handling. The explicit catch is due to the date parsing. Without it, perhaps we would have forgotten that the import can fail !
  • fragility. If something goes wrong, the transaction is rolled back and the whole import will fail. This can be a business requirement actually. But what if the business people realize they don’t want the whole import to fail and want to skip the bad rows? This logic will be hard-coded into the batch. This is where this simple implementation is weak: we don’t have a clear boundary between the business logic (here, just the insertion of a contact) and the pure technical code (transaction management, but also skip policy and so on).

Approach #2 simple batch (batch size = 1)

The second approach differs from the first one in the transaction management :

  • load the XML file into memory with DOM
  • iterate over the DOM tree to extract each contact, open a transaction for each contact and insert it into the database

Here is an excerpt of the implementation :

  1. public class importContacts {
  2. ()
  3. public void run() throws Exception {
  4. DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  5. DocumentBuilder builder = factory.newDocumentBuilder();
  6. Document document = builder.parse(input.getInputStream());
  7. final DateFormat dateFormat = new SimpleDateFormat(« yyyy-MM-dd HH:mm:ss »);
  8. final NodeList contacts = document.getElementsByTagName(« contact »);
  9. for (int i = 0, n = contacts.getLength(); i < n; i++) {
  10. final Element contact = (Element) contacts.item(i);
  11. transactionTemplate.execute(new TransactionCallbackWithoutResult() {
  12. @Override
  13. protected void doInTransactionWithoutResult(TransactionStatus ts) {
  14. try {
  15. jdbcTemplate.update(
  16. « insert into contact (firstname,lastname,birthdate) values (?,?,?) »,
  17. contact.getElementsByTagName(« firstname »).item(0).getTextContent(),
  18. contact.getElementsByTagName(« lastname »).item(0).getTextContent(),
  19. dateFormat.parse(contact.getElementsByTagName(« birthDate »).item(0).getTextContent())
  20. );
  21. } catch (Exception e) {
  22. throw new RuntimeException(e);
  23. }
  24. }
  25. });
  26. }
  27. }

This approach is also straightforward and is actually only a variation of the first one. It could recover from any failure in the handling of a contact and skip it (we didn’t implement this logic for brevity’s sake). It suffers from the same drawbacks (DOM handling, no clear boundary between the business and batch logic). The batch size here is 1 (one row at a time) and we’ll see later that this is not an efficient choice.

Approach #3 : Spring Batch

The third and last approaches uses Spring Batch. To make it short, Spring Batch leverages the Spring Framework for assembling your batch applications, but also provides some features for common import/export operations and a real technical infrastructure (transaction management, retry/skip policies and so on). By using the Inversion of Control pattern, your business code can be clearly separated from the technical code (handled by Spring Batch). That’s where Spring Batch is very useful, because it helps you taking the good decisions for your batch applications.

Implementing the batch

Let’s see how Spring Batch can help us to implement our import scenario. Spring Batch’s sweet spot is chunk scenarios (it is by no means limited to this kind of scenarios). A chunk scenario is divided into 3 phases: read, transform and write. This is what we want to do: read data from the XML file and write them into the database. Fortunately, the transformation phase is optional. So here’s the skeleton of our batch configuration:

  1. <batch:job id=« importer »>
  2. <batch:step id=« import »>
  3. <batch:tasklet>
  4. <batch:chunk reader=« contactItemReader » writer=« contactItemWriter » commit-interval=« 100 » />
  5. </batch:tasklet>
  6. </batch:step>
  7. </batch:job>

Our batch (called a job in Spring Batch’s vocabulary) is made of only one step (« import »). We refer to Spring Beans that will be in charge of reading the data and writing them, respectively. The contactItemReader is here to read the contact from the XML file and Spring Batch comes with a built-in support for that, which is based on Spring OXM. Here is the configuration of the contactItemReader bean :

  1. <bean id=« contactItemReader » class=« org.springframework.batch.item.xml.StaxEventItemReader »>
  2. <property name=« fragmentRootElementName » value=« contact » />
  3. <property name=« resource » value=« file:./contacts_10000.xml » />
  4. <property name=« unmarshaller » ref=« contactMarshaller » />
  5. </bean>
  6. <bean id=« contactMarshaller » class=« org.springframework.oxm.xstream.XStreamMarshaller »>
  7. <property name=« aliases »>
  8. <util:map id=« aliases »>
  9. <entry key=« contact » value=« com.zenika.domain.Contact » />
  10. </util:map>
  11. </property>
  12. </bean>

Notice Spring Batch uses StAX, which means that the whole XML file won’t be loaded in memory. Spring Batch will « stream » it instead and send the Contact objects in chunks to the writer. The size of the chunk will the value in the commit-interval attribute defined previously. Spring Batch does not provide support for writing data in a database. This a good opportunity to take a look at the Spring Batch’s API and implement our own ItemWriter:

  1. public class JdbcContactItemWriter implements ItemWriter<Contact> {
  2. ()
  3. @Override
  4. public void write(final List<? extends Contact> chunk) throws Exception {
  5. String sql = « insert into contact (firstname,lastname,birthdate) values (?,?,?) »;
  6. jdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {
  7. @Override
  8. public void setValues(PreparedStatement ps, int i) throws SQLException {
  9. Contact contact = chunk.get(i);
  10. ps.setString(1, contact.getFirstname());
  11. ps.setString(2, contact.getLastname());
  12. ps.setDate(3, new Date(contact.getBirthDate().getTime()));
  13. }
  14. @Override
  15. public int getBatchSize() {
  16. return chunk.size();
  17. }
  18. });
  19. }
  20. }

Our business logic is now clearly in the JdbcContactItemWriter, which will be called by Spring Batch. This is where the Inversion of Control pattern takes place: our business code is surrounded by Spring Batch, which takes care of technical concerns (batch size, transaction management and so on). Note the JdbcContactItemWriter uses features of Spring’s JdbcTemplate, to send batch updates, which is much more efficient. That’s it for the configuration of the batch! This article is not a Spring Batch tutorial, but the following section shows how to configure the Spring Batch infrastructure.

Spring Batch infrastructure

We’ve just seen how to configure a batch, but Spring Batch needs some infrastructure beans to work properly. These beans are configured only once and any Spring Batch’s job can use them then. If we use here a real database (PostgreSQL), and so real transaction management, we use an in-memory JobRepository (Spring Batch uses a JobRepository to store the state of jobs, which can be useful for monitoring but also to restart failed jobs at the exact same step they failed). Spring Batch comes also with a database implementation of JobRepository. Here is the configuration of our batch infrastructure :

  1. <bean id=« dataSource » class=« org.springframework.jdbc.datasource.SingleConnectionDataSource »>
  2. <property name=« driverClassName » value=« org.postgresql.Driver » />
  3. <property name=« url » value=« jdbc:postgresql://localhost:5432/batchcontact » />
  4. <property name=« username » value=« app » />
  5. <property name=« password » value=« app » />
  6. </bean>
  7. <bean id=« transactionManager » class=« org.springframework.jdbc.datasource.DataSourceTransactionManager »>
  8. <property name=« dataSource » ref=« dataSource » />
  9. </bean>
  10. <bean id=« jobRepository » class=« org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean »>
  11. <property name=« transactionManager » ref=« transactionManager » />
  12. </bean>
  13. <bean id=« jobLauncher » class=« org.springframework.batch.core.launch.support.SimpleJobLauncher »>
  14. <property name=« jobRepository » ref=« jobRepository » />
  15. </bean>

Note we used some default name for the beans (transactionManager, jobRepository). Job beans will use them by default. It simplifies the configuration but also adds some magic.
As a bonus, here is how to start the job in a main program :

  1. public class importSpringBatch {
  2. public static void main(String[] args) throws Exception {
  3. ApplicationContext context = new ClassPathXmlApplicationContext(« /import-spring-batch.xml »);
  4. JobLauncher launcher = (JobLauncher) context.getBean(« jobLauncher »);
  5. Job job = (Job) context.getBean(« importer »);
  6. launcher.run(job, new JobParameters());
  7. }
  8. }

About the Spring Batch solution

So, what should we think of the Spring Batch solution ?

  • complex. Obviously, it looks more complicated than the simple approaches. This is typical of a framework : the learning curve is steeper.
  • better separation of concerns. Spring Batch provides some support out-of-the-box (XML import) and lets us provide and plug our own implementation. There’s less boilerplate code than in the simple approaches.
  • flexibiliy. The batch size can be adjusted without changing the code and we can switch from XStream to JAXB2 or even provide our own marshaller.
  • robustness. As Spring Batch uses StAX, the size of the XML file should not be a problem. We can also change the batch size easily. We didn’t cover all the features that Spring Batch provides for robustness: retry and skip policies, restarting, reacting to the lifecycle of a step (success, failure), etc. All these features can be added declaratively.

It’s time to compare the solutions at runtime now !

Running the batch

I ran the three approaches on the same files, with PostgreSQL as the database. For Spring Batch, I used a simple rule for the batch size: the number of rows / 100. This is perhaps too simplistic, we just have to bear in mind that we can still adjust it easily, if we know about the number of contacts in the incoming file. I won’t say that this benchmark is the most reliable in the world, but it helps to identify some trends. I monitored only the time execution, here are the results:

Contact number 100 1 K 5 K 10 K 100 K 1 M
Simple (batch size = n) 0.35 s 1.01 s 2.63 s 3.54 s 20.39 s 180.2 s
Simple (batch size = 1) 0.77 s 3.28 s 11.2 s 21.16 s 179.06 s 1815.95 s
Spring Batch 1.64 s 1.87 s 3.48 s 4.32 s 12.05 s 84.99 s

A quick analysis tells that the simple approaches are good for small datasets but do not scale, especially the one which creates a transaction for each row. Spring Batch becomes more efficient when the dataset gets bigger (100 K and more). Perhaps we could get even better results by tuning the batch size or by trying another marshaller. I didn’t analyze the memory footprint, but we can guess that Spring Batch’s should be smaller and smoother, as it does not load the whole dataset into memory.
This leads us to think that parameters like the batch size can be relevant when tuning batch applications, which makes us realize that these policies should not be hard coded but rather externalized or even handled by a framework.

Summary

This article was about comparing different ways to tackle with batch applications. The use case is simple but rather representative. Simple, hand-written approaches work but they can require a lot of work and does not end up with good separation of concerns. Scaling can also becomes tricky and a framework like Spring Batch helps to take good decisions. It can also provide support classes for common use cases and helps implementing robust behaviors (retry, skip, batch transaction management) in a flexible way.
The next question is: yet another framework to learn ? If you have really simple batch, you can skip Spring Batch, but these cases are quite rare actually. You’ll always end up with more complicated cases, where you need to take different paths if some steps in the batch fails, have to deal with complex skip policies and so on. And developping your own batch framework to handle these cases is perhaps the most common trap. So if you start a batch application, I would advise to take a quick look at Spring Batch !

6 pensées sur “To batch or not to batch

  • 15 mars 2010 à 12 h 09 min
    Permalink

    What makes Spring batch faster for a huge set of data ?

    Répondre
  • 15 mars 2010 à 13 h 17 min
    Permalink

    in this case, the batch size. If we take the 1 million sample:

      – batch size = n (approach #1) means having a transaction with a large rollback segment, which can take some time for the database to process

      – batch size = 1 (approach #2) means having 1 M transactions, which is just long (start transaction, establish the context, commit 1 M times)

    So the batch size is a very interesting configuration parameter to have, which basically means having some kind of framework like Spring Batch to handle it (if we don’t want to implement it ourselves). I didn’t test it but the Spring Batch approach could even be faster by trying to find the « best » batch size (in a real system, it would depend on the average size of the incoming dataset).

    Répondre
  • 28 mars 2010 à 14 h 47 min
    Permalink

    Thanks for this post.

    There is a (Interesting) typo in the « About the Spring Batch solution » section : you used « curse » instead of « curve » 🙂

    Cheers.

    Répondre
  • 26 avril 2010 à 8 h 34 min
    Permalink

    corrected  🙂

    Répondre
  • 27 juillet 2010 à 7 h 51 min
    Permalink

    want to check, you injected your business database transactionManager into your jobRepository.
    Is this configuration correct?
    If there is a rollback/commit in your jobRepository, it will affect your business database?

    Should there be a separate transactionManager for the jobRepository or can they share the same one?

    Répondre
  • 20 août 2010 à 10 h 40 min
    Permalink

    @kokwai

    the job repository and the business database must share the same transaction manager, otherwise there’s a window where the job metadata and the business data could not be updated atomically. This could happen in the case of errors and would be problematic for a re-start.

    Répondre

Laisser un commentaire

Ce site utilise Akismet pour réduire les indésirables. En savoir plus sur comment les données de vos commentaires sont utilisées.

%d blogueurs aiment cette page :