Integrating large data sets with Salesforce.com has always been a challenge that often requires a lot of trial and error and some creative caching techniques. I have written a number of integrations that have attempted to synchronize large data sets and while they did work they didn't scale very well. This meant that as more data was added to the system the more inefficient the integration became and in some cases the integration was crippled by the amount of data.
I wanted to share some of the techniques I have used to integration large datasets and maybe you can offer some of your thoughts or techniques on how you have solved this issue.
This technique involves downloading records and creating some kind of data structure in memory so that the data can be referenced later to retrieve the Salesforce Id or other key information. In C# you would use a Dictionary object to create a key value pair in memory.
- Data is stored in memory for quick access
- If stored globally then the information can be made available to multiple processes reducing the need to perform additional queries to Salesforce.com improving performance.
- Easy to develop and maintain.
- Memory footprint depends on the number of records.
- Creating the dictionary map for large data sets can be very time consuming
- Large data sets could cripple the hosting server by eating up memory
So reviewing the pros and cons it's safe to say the memory caching of data is best when used with smaller data sets or data sets that you can reasonably control the size of. This might mean looping through your source data set ahead of time in an attempt to only query back records that you will eventually use.
This technique involves downloading records into a database for caching rather than keeping them in memory. This could be done during runtime when the integration is executed or it can be can be done in a separate process that replicates data from SFDC on a scheduled basis.
- You can utilize joins and other efficient database techniques to prepare your data for upload to SFDC
- Integration process is faster and able to process large sets of records in a more efficient and timely manner
- Increases architecture complexity as the process that replicates the database can affect data integrity or reliability of the integration program
- Server hosting database needs to be maintained and monitored in addition to the integration program
The pros and cons for this scenario show that this technique is best used for caching of larger data sets and because of the cost in terms of resources and monitoring should be implemented in situations where there is a team in place that can manage this type of integration and be able to troubleshoot any issues that arise in any of the replication processes.
This technique involves an attempt to improve performance by spawning multiple threads to download or upload data to and from Salesforce in parallel. I have had mixed results with multi-threading, with more failures than successes at creating a reliable integration that was able to effectively scale to more complex business logic.
Also, I have had instances spawning multiple threads that each took a fair amount of time to complete and have had those sessions killed by Salesforce. This made the integration program unreliable.
- Improves performance by allowing transactions to run in parallel
- Allows programs to process larger data sets by downloading or uploading data in parallel
- Complex to develop and troubleshoot
- Unreliable for anything more than 2-3 threads
In my experience multi-threading just isn't a good option for integrations with Salesforce.com. I have used it with various successes and mostly to try to get large amounts of data uploaded to Salesforce within a certain period of time. With the GA release of the Bulk API the use of multi-threaded integrations to download or upload data should be diminishing.
The Bulk API is a topic I would like to cover in a later post as I have not really worked with it very much just yet. So I would like to get a few real world experiences under my belt before I tackle that subject
This technique involves breaking up large data sets into smaller batches for processing. In terms of downloading and uploading data to Salesforce batch processing pretty much happens by default since that is the underlying architecture of the Web Service API. However, when processing source or target files many developers (including myself) tend to deal with the records all at once. Meaning downloading all records from Salesforce before attempting to process any business rules or logic.
In some cases batch processing isn't always ideal if you are trying to make decisions based on the whole dataset but in most cases there is no need to process all records at once. For those situations I suggest working with a batch of records and performing any logic as well as inserts or updates before processing the next batch. This could mean grabbing only 100 records at a time from the source file to upload to SFDC or only downloading 100 records for processing.
For the most part this technique is especially useful when you are not sure how the data or program will need to scale. Its a pretty safe way of ensuring that the program will be able to process large data sets. However, it also means that you will not be able to predict how long the program will take to complete its process since that would be dependent on the number of batches that need to be processed.
Overall, I feel working with large data sets requires a combination of all of these techniques and finding a balance that meets the needs of your users even if it does mean making some compromises in performance.
I also feel that in the near future Salesforce will develop more tools and API's to help developers work with large datasets. It is a common issue for many of us that work in the enterprise space and one that Salesforce is starting to effectively address.
Post a comment if have any other tips on dealing with large data sets or if you have any questions.