February 4, 2019

They are Millions!

By Andy Gasparini

The Million Items Query

It is common for humans to think that if one can summarize a task in a few words, then the task itself will be easy to implement. In the case of SQL or Entity Framework the last statement is commonly false for projects that involve an extensive amount of stress on a database, particularly a single table. Most of the time we try to find ways around it, like design the database in different creative ways, or simply increasing the power on our cloud subscription. As the title referencing the video-game They Are Billions may have given it away, sometimes entries are almost like zombies, slowly lurking into a database one by one, clogging the systems and eating brains, or in this case computational power and consequently money, and valuable design time.

A Different Approach

Commonly the following statements are used to insert an item into a table with Entity Framework:

var myObject = new MyObject{ fields };

_repo.Insert(myObject);

try
{
    await _repo.SaveChangesAsync().ConfigureAwait(false);
}
catch (Exception ex)
{
    // Your emergency maneuvers!
}

But what if the service or application you are building requires the insertion of millions of items? Add a simple for loop, right?!

try
{
    // This list could contain millions of items!
    List<MyObject> objects = new List<MyObject>()
{
    // Your objects
};

    foreach (MyObject object in objects)
    {
        _repo.Insert(object);
        await _repo.SaveChangesAsync().ConfigureAwait(false);
    }
}
catch (Exception ex)
{
    // Emergency maneuvers!
}

Would this work for 10,000 items? Maybe, yes! What about 100,000 items? Maybe, and most likely not! Then what about 1 million and above? I don’t think so!! So, what could the solution be, stored procedures?! Redesign the database to work around the million entries?! Pushing your cloud subscription to the max?! No. A simple yet efficient solution can be used by writing a few functions in C# that will allow you to use the System.Data SQL libraries in a clever way, using an almost forgotten class called SqlBulkCopy, which will allow the insertion by simply passing a DataTable object. So here is the approach:

1. Convert your object fields into DataTable columns (in this case I am using reflection but you don’t have to)

public DataTable ConvertCollectionToDataTable<T>(ICollection<T> itemsToConvert)
{
    if (itemsToConvert.Count < 1)
    {
        // I made a custom exception, but you
        // can really throw anything that suits you
        throw new BulkException("The collection that is "
                                    + "to be inserted is empty");
    }

    PropertyDescriptorCollection props =
        TypeDescriptor.GetProperties(typeof(T));

    DataTable dataTable = new DataTable();

    for (int i = 0; i < props.Count; i++)
    {
        PropertyDescriptor prop = props[i];
        dataTable.Columns.Add(prop.Name, prop.PropertyType);
    }

    DataTableInsertion(itemsToConvert, props, dataTable);

    return dataTable;
}

2. Insert the values of those fields for every single object into the DataTable

private void DataTableInsertion<T>(ICollection<T> itemsToConvert,
                                   PropertyDescriptorCollection props, 
                                   DataTable dataTable)
{

    object[] values = new object[props.Count];

    foreach (T item in itemsToConvert)
    {
        for (int i = 0; i < values.Length; i++)
        {
            values[i] = props[i].GetValue(item);
        }

        dataTable.Rows.Add(values);
    }
}

3. Figure out the best options settings for your insertion and if you do want to create a transaction or not for the query execution, and then insert

  • With transaction
public void BulkInsert(DataTable dataTable, 
                       SqlTransaction transaction, 
                       string destinationTableName)
{
    // Standard checks //
    if (dataTable == null)
    {
        throw new BulkException("Cannot insert an null data table!");
    }

    if (dataTable.Columns.Count < 1)
    {
        throw new BulkException("Cannot insert an empty data table!");
    }

    try
    {
        // Create your SqlBulkCopy options, TableLock is optional
        SqlBulkCopyOptions copyOptions =
            SqlBulkCopyOptions.TableLock |
            Sql-BulkCopyOptions.FireTriggers;

        // Create your SqlBulkCopy object and pass in your connection
        // (in this case in the transaction), your copy
        // options, and your transaction
        SqlBulkCopy bulkCopy =
            new SqlBulkCopy(transaction.Connection,
                copyOptions,
                transaction);

        // Set you destination table
        bulkCopy.DestinationTableName = destinationTableName;
        // Insert your items!
        bulkCopy.WriteToServer(dataTable);

        // Commit the transaction
        transaction.Commit();
    }
    catch (Exception ex)
    {
        // In case of failure rollback and throw an
        // exception (so you can go to a higher level catch block
        // and throw a specific exception)
        transaction.Rollback();
        throw;
    }
}
  • Without transaction
public void BulkInsert(DataTable dataTable, string destinationTableName)
{
    // Simple standard checks! //
    if (string.IsNullOrEmpty(_connectionString))
    {
        throw new BulkException("Connection string is null or empty!");
    }

    if (dataTable == null)
    {
        throw new BulkException("Cannot insert an null data table!");
    }

    if (dataTable.Columns.Count < 1)
    {
        throw new BulkException("Cannot insert an empty data table!");
    }

    // Create an SqlConnection passing in your connection string
    using (SqlConnection sqlConnection =
                            new SqlConnection(_connectionString))
    {

        // Select your options for the insertion
        // SqlBulkCopyOptions.TableLock can be
        // removed if you don’t want to table lock
        // SqlBulkCopyOptions.UseInternalTransaction,
        // set it to no transaction
        // and will throw an exception
        // if something fails and you try to rollback
        SqlBulkCopyOptions copyOptions =
                                    SqlBulkCopyOptions.TableLock |
                                    SqlBulkCopyOptions.FireTriggers |
                                    SqlBulkCopyOptions.UseInternalTransaction;

        // Open the connection to your db
        sqlConnection.Open();

        // Create an SqlBulkCopy object with all you
        // have created until now,
        // the last parameter is the transaction, which we don’t
        // have and don’t want

        // Mind: if you do happen to use a transaction,
        // you must specify that in the options,
        // otherwise passing in a transaction object as a parameter
        // will throw an exception no matter what
        SqlBulkCopy bulkCopy = new SqlBulkCopy(sqlConnection,
                                                copyOptions,
                                                null);

        // set the destination table
        bulkCopy.DestinationTableName = destinationTableName;

        // Insert your entries!
        bulkCopy.WriteToServer(dataTable);

        // Close the connection
        sqlConnection.Close();
    }
}

That’s it!

Still unsatisfied?!

What can we do to insert even more entries…faster?!

Simply Batch!

If your solution requires heavy duty entry insertion although you don’t really want to up that cloud subscription to the max, then simply batch your entries. This is straightforward and easy to adapt to different needs, the batches can be set by a key in the web.config file and injected into the class that is batching or simply define it as a field.

Something like this:

   <add key="StandardBatchSize" value="100000"/>

Thus, create a method which splits your collection. Note: I am returning a stack here, you can easily adapt the methods ConvertCollectionToDataTable and DataTableInsertion to convert a stack to a DataTable. I am mainly using a stack for efficiency concerns as a stack is O(1) to create and it’s easy to keep track of when inserting or removing.

private Stack<T>[] Split<T>(ICollection<T> itemsToInsert)
{
    // Make a list of stacks (our batches)
    List<Stack<T>> splitStacks = new List<Stack<T>>()
    {
        new Stack<T>()
    };

    // Necessary evil, convert itemsToInsert to array
    T[] itemsToInsertArray = itemsToInsert.ToArray();

    // Start 2 counts, one is going to count our batches,
    // one is going to our split split stack of batches
    int splitStacksTrackCount = 0;
    int trackingCount = 0;

    for (int i = 0; i < itemsToInsertArray.Length; i++)
    {
        // Is a batch completed and ready to go,
        // yes then package it for shipping!
        if (trackingCount >= _batchSize)
        {
            splitStacks.Add(new Stack<T>());
            splitStacksTrackCount++;
            trackingCount = 0;
        }

        // Put in the items into the box!
        splitStacks[splitStacksTrackCount]
                            .Push(itemsToInsertArray[i]);
        trackingCount++;
    }

    // Yes an array of stacks ;),
    // computer ram thanks you for your cooperation!
    return splitStacks.ToArray();
}

Make a method to insert your items into the database (similar to our previous bulk insert):

private void BulkInsert(DataTable dataTable, string destinationTableName)
{
    // Make SqlConnection with data string
    using (SqlConnection sqlConnection = new SqlConnection(_connectionString))
    {
        // Select your options, in this case no transaction or internal transaction
        SqlBulkCopyOptions copyOptions =
                                    SqlBulkCopyOptions.TableLock |
                                    SqlBulkCopyOptions.FireTriggers |
                                    SqlBulkCopyOptions.UseInternalTransaction;

        // Open connection
        sqlConnection.Open();

        // Create your SqlBulkCopy object with all your
        // goodies, with a null transaction in this case
        SqlBulkCopy bulkCopy = new SqlBulkCopy(sqlConnection,
                                                copyOptions,
                                                null);

        // Set the destination table
        bulkCopy.DestinationTableName = destinationTableName;

        // Insert your items
        bulkCopy.WriteToServer(dataTable);

        // Close your connection
        sqlConnection.Close();
    }
}
public void BatchedBulkInsert<T>(ICollection<T> itemsToInsert,
                                 string destinationTableName)
{
    // Standard checks //
    if (string.IsNullOrEmpty(_connectionString))
    {
        throw new BulkException("Connection string is null "
                                        + "or empty!");
    }

    if (itemsToInsert == null)
    {
        throw new BulkException("Cannot insert an null collection!");
    }

    if (itemsToInsert.Count < 1)
    {
        throw new BulkException("Cannot insert an empty collection!");
    }

    // Split into batches
    Stack<T>[] splitStacks = Split(itemsToInsert);

    // Convert to data table, method adapted to receive a stack
    // A stack is mainly used here to lower the overhead
    // as the creation of a stack is O(1), and there are instances
    // where thousands of stacks are created
    DataTable[] dataTables = ConvertToDataTables(splitStacks);

    // Insert!
    for (int i = 0; i < dataTables.Length; i++)
    {
        BulkInsert(dataTables[i], destinationTableName);
    }
}

Finally wrap everything into a method which will do the trick like the above and voila!

Conclusion

The intimidating task of designing a system that can enter million of entries often comes into play when designing an auditing feature or something similar, storing a lot of records all at the same time. Although sometimes the most daunting tasks of working around the database design can be transformed in a fun college-like algorithm design excursion that can be taken care of directly in C# in this case, without any stored procedures writing, and can be easily managed, changed and reused. All you need to remember is just woulda, coulda, shoulda.

Photo by Agentex

Andy Gasparini - Full Stack Software Developer and Consultant

About Andy Gasparini

Andy Gasparini is Full Stack Developer and Consultant at Centare. Andy is mainly focused on C# and Angular, but also loves Java and React. He is an Embedded Systems, IOT Enthusiast and loves cloud-based systems like AWS and Azure. In his free time he likes to learn about mathematics, create libraries and work with Embedded Systems. Some of his projects are available at www.braintobytes.com