How not to use Includes in Lambda Statements
Published: 3rd June 2015
I had been looking into performance issues with one of our products. This is the second in a series of posts in where I have found major issues in query performance. Read Why not to use In Memory Collections in LINQ to Entities for more information. As I stated in that post, data access is using entity framework talking to a SQL Database with a lot of queries being built using LINQ or Lambda statements. For the most part this is good, however it is very important to stop and look at the SQL that is generated by these LINQ queries. In the example I am about to describe, the SQL that was generated was quite excessive.
The following example was found in our production code, I have cleansed this to hide business logic. The includes are string parameters passed in the Lambda statement that tell Entity Framework to return not only the selected table of data, but the associated child records in that table. Where a string of table names separated by periods is specified, entity framework looks across the relationships to get the data.
var filter = "some string";
var list = context.Record.FindWhere(
x => x.RecordCategories.Id == RecordCategoryId
&& x.FieldToFilter == filter
&& x.SomeBoolean1,
"SubRecord1",
"SubRecord2",
"SubRecord3.ParameterTable.SubRecord3")
.OrderBy(x => x.Id).ToList();
You may notice that the SubRecord3 line looks a little odd. Not only are we getting records in the SubRecord3 table, we’re also joining out to a parameters table and then finding all the records in the SubRecords3 table that are linked to the same parameters table, even if the record doesn’t belong to the parent table. This alone would be bad, however given the purpose of this query is to return six total values it gets even worse. On our ten thousand record dataset I found the following:
- 823 lines of SQL were generated.
- 109,989 rows of data were returned.
- 264 columns per row were returned.
- The page executing the query took 91 seconds to process.
This was very bad, so I set about rebuilding the data queries to get just what I wanted. Granted this required more code than the original query:
var filter = "some string";
var subRecord3Classification = OurEnum.OurClassification.ToString();
var subRecord2s = (from item in context.SubRecord2.FindAll()
group item by new { item.ParentId }
into items
select new
{
items.Key.ParentId,
Total1 = items.Sum(x => x.Field1),
Total2 = items.Sum(x => x.Field2)
}).ToList();
var subRecord3s = (from item in context.SubRecord3.FindWhere(x => x.ParameterTable.Classification == subRecord3Classification)
group item by new { item.ParentId }
into items
select new
{
items.Key.ParentId,
Total3 = items.Sum(x => x.Field3)
}).ToList();
var parentRecords = (from item in context.Record.FindWhere(x => x => x.RecordCategories.Id == RecordCategoryId && x.FieldToFilter == filter && x.SomeBoolean1, "SubRecord1")
select new
{
RecordId = item.Id,
RecordStatus = item.RecordStatus,
OwnerId = item.SubRecord1.OwnerId,
OwnerName = item.SubRecord1.OwnerName,
Value1 = item.RecordStatus ? item.Value1 : 0,
Value2 = item.RecordStatus ? item.Value2 : 0
}).ToList();
Since most of the data in the previous query was surplus to requirements, I re-built this as three separate queries on each required table pulling back just the bare minimum I needed. In both cases there is code I have omitted that converts the results into a simple object model returning a row per parent record, however it can be argued that another step would have been to return a single set of totals. The difference however was as follows:
- The first query generated 16 lines of SQL, returned 10,227 rows with 3 columns per row.
- The second query generated 14 lines of SQL, returned 20 rows with 2 columns per row.
- The third query generated 16 lines of SQL, returned 10,227 rows with 11 columns per row.
- The same totals were achieved.
- The page executing the query took less than 1 second to process.
For the most part the original functionality works well enough on our smaller clients, but on our bigger clients it is very sluggish. The question could easily be raised as to how we didn’t spot this before, the answer to that there were tight time-scales imposed by government legislation meaning that one way or another the base functionality had to go to the customer by a set date.
Now if you can imagine my groans at finding out what the original query above was doing, imagine how I felt when I found something like the following shortly afterwards. This one took 2 minutes to execute and SQL Profiler refused to show the SQL Generated:
var list = context.Record.FindWhere(
x => x.RecordCategories.Id == RecordCategoryId
&& x.FieldToFilter == filter
&& x.SomeBoolean1,
"SubRecord1.TertiaryRecord1",
"SubRecord2",
"SubRecord3.ParameterTable1.SubRecord3",
"SubRecord4",
"SubRecord5",
"SubRecord6",
"SubRecord7",
"SubRecord8",
"SubRecord9",
"SubRecord10",
"SubRecord11",
"SubRecord12",
"SubRecord13.ParameterTable2").OrderBy(x => x.ID).ToListd);
When it comes to writing any database query using LINQ to Entities or Lambda Expressions I have the following advice from my investigations:
- Always check the SQL being generated by your code.
- Limit what you bring back from the database to just what you need.
- Always test on a dataset that is bigger than your largest customer.