Refactoring: really?
I have a 3 month old son, and I’m often up at silly-o-clock at night with him.
One thing I’ve taken to doing is watching various videos on youtube. I watched through the gocon 2014 videos, a bunch of pycon ones, some ted talks, and recently started watching through some ruby based videos.
I’ve never really been into Ruby as a language. It seems to have some nice things, but python always seemed cleaner to me. One thing I like about the community - at least as far as the rubycon videos indicate - is they seem to hold theoretical best practices in high regard, and spend a lot of time trying to establish conventions and teach good thought processes, rather than simply teaching how to use specific libraries.
Anyway, I watched “Refactoring from good to great” by Ben Orenstein.
He started with the following code:
	class OrdersReport
		def initialize(orders, start_date, end_date)
			@orders = orders
			@start_date = start_date
			@end_date = end_date
		end
		def total_sales_within_date_range
			orders_within_range = 
				@orders.select { |order| order.placed_at >= @start_date &&
				                 order.placed_at <= @end_date }
			orders_within_range.
				map(&:amount).inject(0) { |sum, amount| amount + sum }
		end
	end
to actually use the previous code, you need to write:
	OrdersReport(orders, start_date, end_date).total_sales_within_date_range()
Which seems a little… odd to me. The class initialiser/constructor is taking the arguments, rather than the function which actually refers to them. Presumably because you want to set up a report, and then run multiple queries against the same data you originally specified.
Except that the filtering of data happens within the total_sales_within_date_range
function, not in the constructor…
I thought he was going to go for “why does this need to be a class?” and suggest using simple functions instead.
I imagined he was going to promote “API-first” style refactoring. Asking “how do we want to use this functionality?” first.
Perhaps ending up with the library API usage looking like:
	Reports.orders_in_range(start_date, end_date).total_sales()
or even not bothering with the classes at all, and just go for something like:
	def sales_between_dates(start_date, end_date)
		Order.where(placed_at: (start_date..end_date)).sum("amount")
	end
Which if we’re using Rails / Active Record maps to a single SQL query:
	SELECT sum(amount) FROM orders WHERE placed_at > ? AND placed_at < ?;
but instead, he suggested refactoring all of the logic out into “one line methods”:
	class OrdersReport
		def initialize(orders, date_range)
			@orders = orders
			@date_range = date_range
		end
		def total_sales_within_date_range
			total_sales(orders_within_range)
		end
		private
		def total_sales(orders)
			orders.map(&:amount).inject(0, :+)
		end
		def orders_within_range
			@orders.select { |order| order.placed_between?(@date_range) }
		end
	end
	class DateRange < Struct.new(:start_date, :end_date)
		def include?(date)
			(start_date..end_date).cover? date
		end
	end
	class Order < OpenStruct
		def placed_between?(date_range)
			date_range.include?(placed_at)
		end
	end
Now, instead of one, single, easy to grok method, you’ve split everything out into 6 methods spread over 3 classes. Besides that, you now have to query the entire orders table, and then sort the results in Rubyland, rather than doing it in SQL. You then run 4 ruby functions on every row in your query.
Want to know why rails is considered slow? If this is the mindset that goes into how “great” ruby code looks, then I’m really not surprised at all.
Say we’re not using a SQL database though, and it has to be done in Ruby.
In which case, the following questions:
- Why split apart - total_sales_within_date_range, so that it (as a method) doesn’t accept a date range object as an argument, but relies on the class attribute- @orders, but there’s a private method- total_sales, which does accept a- ordersargument, and doesn’t rely on the class attribute?
- Around about 13 minutes in, Ben talks about how decoupling different parts of the code is good. How then does adding - DateRangedependent methods to the- Orderclass, and making the- OrderReportmethods dependent on both of them fit into this? It’s spaghetti!
- Since - .cover?is a built-in ruby method for seeing if a value is within a range, why not name- DateRange.include?instead- DateRange.cover?
So why is everything split into 1-line-methods anyway?
Honestly, I think it’s because the code is hard to grasp straight away when written as a single function, and “TDD” (Test driven development) encourages splitting things up for ease of writing unit tests, rather than for ease of developer comprehension.
A function as simple as this shouldn’t be hard to read but Ruby, like Perl, makes such functions easily overwhelming.
Here’s the original code, but as a single function:
	def total_sales_within_date_range(orders, start_date, end_date)
		orders_within_range = 
			orders.select { |order| order.placed_at >= start_date &&
							 order.placed_at <= end_date }
		orders_within_range.
			map(&:amount).inject(0) { |sum, amount| amount + sum }
	end
I’m not a Ruby expert, so I’ll switch to Python at this point, to show how it can be made clearer.
Here’s a direct/naive python translation:
	def total_sales_within_date_range(orders, start_date, end_date):
		orders_within_range = [order for order in orders if
		                       order.placed_at >= start_date and
		                       order.placed_at <= end_date]
		return reduce(lambda total, amount:total+amount,
		              (order.amount for order in orders_within_range), 0)
which is still quite unclear, and certainly a bit wordy.
Actually simply going purely imperative makes the code possibly a little clearer:
	def total_sales_within_date_range(orders, start_date, end_date):
		total = 0
		for order in orders:
			if start_date <= order.placed_at <= end_date:
				total += order.amount
		return total
Which, incidentally, should be quite a bit faster for large data sets,
as it only goes through the list once. (O(n) rather than O(2n))
This is the direction that Go encourages. Rather than try to have loads of clever language constructs, simplify things down as much as you can, and write direct, obvious code that doesn’t surprise people.
I appreciate that - but Go can go a bit far, and end up feeling somewhat verbose.
Back to the example. It can be done in a “functional” style slightly better though:
	def total_sales_within_date_range(orders, start_date, end_date):
		return sum(order.amount for order in orders if
		                            order.placed_at >= start_date and
		                            order.placed_at <= end_date)
or
	def total_sales_within_date_range(orders, start_date, end_date):
		return sum(order.amount for order in orders if
		                            start_date <= order.placed_at <= end_date)
Under the hood, then only things you really need to understand here are generator expressions, which in this case reads almost like the original SQL above.
OK, but it was just a silly example, most things can’t be reduced to a single expression.
Fair enough. In which case, the following concepts of refactoring might make more sense:
- How do we actually want code using this library to look?
There’s only two real reasons for abstractions:
- To make life easier for the library user.
- To allow structuring code in a cleaner / different way.
Making life easier is for instance boilerplate removal, hiding away the actual semantics of SQL so that the user doesn’t need to say every time:
	 query("SELECT * FROM orders WHERE id=?;", id)
but can just say:
	Orders.get(id)
for instance.
- What are the best language-level constructs available for this task?
Would a class be best here? Or a generator? Or a pure function? Or a module? A Context Manager? Or would it be best to use a dict / hashmap rather than a class?
For instance, we have classes as an available language construct. And they can be used quite nicely for simple ‘modelling’ of real world things (Sale, Order, Item, Customer, etc), but they also can be used for abstract code-level concepts.
Why might OrderReport be better as a class, rather than just using pure functions?
The big win is that you can do stuff ahead of time, and keep computed stuff around between methods.
For instance, we could pre-select the orders within the date range during the constructor, and then any methods we run would work on a pre-selected range of stuff, rather than having to check again and again.
	report = OrdersReport()
		.exclude_types('test')
		.currency('EUR')
	last_week = report.from(now-timedelta(weeks=1)).select()
	last_month = report.from(now-timedelta(weeks=4)).select()
	trend = last_week.count() - (last_month.count() / 4)
type of thing.
Anyway. When designing a library, or structuring code, first think, “How do I want this to be used”, next: “which bits might be re-used”, and “which of those are actually worth abstracting out”. As Rob Pike says, a little duplication is better than a lot of extra complex abstraction. This video, by Sandi Metz, is extremely good, and gives a much better code structure, and usage. “Duplication is far cheaper than the wrong abstraction.” - Sandi Metz