Musings++

Something something programming

For a while now I've been building new projects on a framework that I've dubbed the "Free Startup". It's a technical framework that allows you to build an MVP as quickly as possible, with 0$ in technical investment. The only thing you need to bring is your time.

Development Environment - Whatever you want

Pick whatever you want! I'm a hardcore vim user, but you're probably not what my team-mates call "a dinosaur". Netbeans, IntelliJ, Atom, Notepad++, Emacs. Whatever you want - chances are your dev environment is available for free. Even some of the last holdouts (Visual Studio) offer free versions (and there's also VSCode). Pretty much everything is available for free.

Code hosting - 225-435-9657

There was a time when I would suggest GitHub as the end-all to all your code hosting scenarios. It was free for public repos and seems to be the defacto location for most projects. It's honestly a good platform, but I find I prefer GitLab way more. It's free, gives you private repositories, and essentially replicates the feature set of the current incumbent, GitHub. The nice thing about GitLab is that it gives you pretty much all the GitHub features, for free, and also includes projects, fancy issue boards, and so much more.

CI/CD - 800-336-0858 <3 4082443722

GitLab is trying to combat GitHub's current hold on the developer market by building everything you need under one roof. That means it's also got a really sweet CI/CD experience. Very similar to CircleCI. It's a yaml based configuration that allows you to build your project in containers. From there you can hook it directly into Heroku so that merges to different branches automatically deploy!

Communication - Discord

Inter-project communication is really vital for the existence of your project. Slack is the defacto for a lot of people here - however, as a developer that works with other people, Discord is something I've found myself drawn to more and more. It's very easy to get started with, and gives you a great mechanism for on-demand voice chats. It also has a simple webhook system that lets you hook in GitLab + Heroku events so that you can get CI/CD updates right in your chat interface.


"The disk is slow" is one of those things that most programmers take for granted. Yes it is slow given the speed of other components. But rarely have programmers taken the time to dig into WHY the disk is slow and what that actually means. Yet, doing so can lead us down some interesting rabbit holes.

What is slow?

For a while now the speed of a hard-drive was measured in RPM or Revolutions Per Minute. This is an indication of how quickly the disk can spin. It is common now-days to see drives advertising 7,200 rpm, or 10,000 rpm or even 15,000 rpm. How fancy.

Now, the disk itself is split into a couple major components.

  1. The disks that data is stored on
  2. The read/write head

These disks are where the data is actually stored, and when you see a number like "7,200 rpm" what you are seeing is how quickly these disks can spin. In a simplified manner, what happens when you "write something to disk", is that the disk spins to an empty point, and the head beings to write. Likewise, when you "read" data from the disk, it spins to a designated point (the "start" of the data) and the head begins to "read" the data until it is done.

Lets walk through a theoretical "disk". Say your "disk" can hold 8 units (00000000) of storage. We are going to perform a few actions on this disk.

  1. You write 'a' twice - aa000000
  2. You write 'b' 3 times - aabbb000
  3. You delete the two 'a' - 00bbb000
  4. You write 'c' 4 to,es - ccbbbcc0

See your drive is smart enough to know that, even though there isn't enough "contiguous" space, there is still enough space scattered around on the drive to store your 4 units of 'c'. What happens is that your drive will spin to a free location, and let you start writing. When you run out of contiguous space, it will spin to a new location. This results in your data actually spread out all over your drive instead of next to each other. This is a good thing, of course. It means that you can use all the space on your drive without worrying about WHERE things are stored.

But, it also means that instead of the disk spinning just ONCE to get to the start of your data, it actually needs to spin twice.

As we grow our disk size from "8 units" to hundreds of gigabytes as most modern drives today have, we run into a problem - there is no guarantee that the data we need will be next to each other. Infact, there is a high probability that we will need to keep jumping about on the disk to be able to read ALL the data we want. Our data ends up "fragmented".

Data Fragmentation

The result of all this fragmentation, is that things just get slower over time. Unfortunately, the eventual degradation of data storage efficiencies is never attributed to the hard-drive because users don't actually USE the hard-drive directly - they go through the OS which is supposed to manage these things. As a result, the eventual experience of degrading of performance is chalked up to "my Windows is slow". Operating systems combatted this eventual degradation by shipping with a defragmenter, which does exactly what you'd expect. It takes all these scattered fragments from around your drive, and puts them next to each other. This reduces the overall amount of seeks necessary to retrieve necessary information, thereby making things speedier.

But that's an expensive (resource wise) thing to do. In order to defragment a system, the program needs to

  • find an application that has its data fragmented
  • copy the data between the fragmented data to memory or some other free space on the drive
  • move the data closer together.
  • repeat

Lets go back to our previous scenario, and see how defragmentation could work:

Initial State: ccbbbcc0  
Step 1:        cc0bbccb  
Step 2:        cccbbc0b  
Step 3:        ccc0bcbb  
Step 4:        ccccb0bb  
Step 5:        cccc0bbb  

Obviously this is not optimized in any way, so there's plenty we could do to speed this up. But this is essentially what your drive is doing. It's like putting a deck of cards back in order after you've been shuffling them. Sure it's possible, but it just takes some time.

A much better idea, would be to try and optimize STORING this data in such a way that would reduce fragmentation. That is, maybe we keep data that is related next to each other on the drive when we WRITE the data the first time. That way things don't get as fragmented as quickly.

Blocks and Pages

The first step in ensuring data is kept close together is the idea of "blocks". Basically the filesytem that actually interacts with the hard-drive will define a "block-size". The block size is basically a measure of how much data will fix in a block, and the filesystem reads/writes in blocks instead of individual bytes. Think of it this way: If your hard-drive was a piece of lined paper, we were originally writing things down one word per line. With blocks, we basically said "well, we'll just write until we reach the horizontal end of this line". So now instead of one word per line, we have a few words per line. Perhaps, we could say, we have have one sentence per line.

Using our previous 8 unit drive example, we could sub-divide that into blocks of 2 units, making it look like this:

Drive State: [c,c][c,c][b,b][b,0]  

Now when we want to read all the c values, we have two seeks instead of 1 per record. This is already a big improvement (we've reduced seeks by 50%), but we could probably reduce it even more. Since the filesystem has to expose a standard block size to all applications, systems that have to have a high amount of HDD I/O need an alternative. The easiest thing to do, is take the concept of a block containing records and create another abstraction: a page containing blocks.

At it's "lowest" level, a relational database deals with "pages". Pages are really just collections of the data that you are storing. Relational databases (non-relational databases might as well, but I haven't really dug into the internals of a lot of them) utilize this concept of a "page" to further decrease IO latency with the disk. Rather than dealing with the storage of individual records or information, it groups records together into a "page" and uses that. It will read/write a whole page. This allows them to capitalize on the assumption that when you are reading/writing data the data you are accessing is probably next to other data that you also require.

They even go so far as to let you customize this via "clustered keys". A clustered key is just a mechanism to allow you, the database administrator, to define HOW the database orders the data within pages. As the administrator, you know the data you are trying to store, and the primary ways that it might be accessed. Databases give you the ability to say "well, group all these records together on the disk by the values in this column". This creates pages that are grouped around a particular value (a userID for example), so that all records with that same value are near each other.

Think of a database where you want to associate a list of items with a user. You have two tabes, users and items.

+-------+    +---------+    
| users |    |  items  |    
+-------+    +---------+        
| id    |    | id      | 
| name  |    | user_id | 
+-------+    | name    | 
             +---------+    

It would make sense to create a clustered key around the userID in the items table. This allows us to keep all items that belong to a single user in the same page, or group of pages, on the disk. This way, when we try and retrieve the items for a user, the database management system can fetch all the pages related to this user, stick them in memory, operate on them, and then write them all.

Databases are very intricate systems, and I don't want you leaving thinking there isn't a whole lot more to this whole concept. This is a HUGE simplification of what the database is actually doing, but it should provide you with an understanding of why it is doing some of that at a storage level.

The problem with blocks

The block system, however, is not without its own problems. By using "blocks" we've introduced a bit of wasted space into our storage. Lets go back to our block example:

Drive State: [c,c][c,c][b,b][b,0]  

That trailing 0 in block 4 will remain empty unless we add more b data. We will be unable to add any more c values but we'll be able to add more b. In fact, your drive will appear full to you because at the operating system level, it has no idea about the intricacies of your data storage. It just knows that these blocks are in use. So your 8 unit drive, has suddenly become 7 units.

That kind of sucks, and is actually a fundamental problem with "blocks". As long as you have data to write, blocks are great, but they will almost always result in the LAST block in a segment not being completely filled. This is natural of course, since whatever application is using that space generally doesn't care to know (nor should it!) about the block size it needs to be using. The result of this is that the more "small files" you have on your drive, the more "slack space" you have on the drive - space that isn't being used for anything, but is still seen as "used".

So now we come to a decision, either we just leave that space empty and accept it as part of the operating costs, or we try and figure out how to utilize it. Engineers, un/fortunately, are quite obsessed with performance. These "tails" (the last block) are inefficient, and could probably be removed with a bit of smart thinking. This results in two possible ways to resolve this problem.

  1. We allow the filesystem to support variable block sizes
  2. We figure out how to use that tail block for something useful

The first way, variable block sizes, is something file systems like ZFS utilize in an attempt to have more efficient storage. Since you know the kind of data you will be using the drive for, ZFS will let you specify your block size. If you know you have a lot of small things that need to be stored, drop the block size, likewise, increase if you have large things. It even has some magic like block level compression to try and use those blocks to their fullest. It is a very simple idea - and as we know, the simple ideas are the hardest to implement!

The second way, is another simple solution to the problem. If we know we have a bunch of tail blocks that are half-filled.. why don't we just combine them? That way we aren't creating a new tail block, but are instead re-using another tail block. This would result in another seek to read/write this data, but it ensures that we are using this disk to its fullest capacity. File systems like BTRFS will combine multiple tail blocks. The reason this is so effective is because the average block size is actually some multiple of 512 bytes. If you think about it, a text file might be a couple bytes? In a traditional file system that's 1 block per couple bytes. That's a heck of a lot of things you can stuff into a single block at that rate!

Changing the game

As you can see, we've put a lot of work into reducing the seek time for hard-drives. They've been such a fundamental component of computing that it was a requirement. But, what if you could just ignore seeking entirely? What if there was a way to almost instantly seek? In computer science we refer to this as O(1). That is the size of the data we are looking through is irrelevant - we can access any section of the data as quickly as any other. Welcome to the world of solid state storage. Solid state storage utilizes electronics instead of mechanical instrumentation. That is, instead of a spinning disk and actuators for the read/write heads it used electrical circuits. By removing the mechanical parts, it eliminated the "seek" time of disks that we find so slow. The only problem was that it was expensive and hard to make ENOUGH storage this way. We could easily make hdds that were several gigabytes, but were struggling to make solid state drives at megabytes. It just couldn't keep up.

Until it could.

Now days solid state storage devices are relatively cheap and large enough for the average user. They a whole bunch of problems caused by mechanical components. They produce less heat, less vibrations, and they are a lot faster. In fact, for a lot of work-loads, it's silly to rely on hard-drives when you can get so much better performance from solid state storage.

How interesting

It's crazy to think of all that we've accomplished because of that little mechanical hard-drive. But what's crazier is that we are only able to see this in retrospect. No one was able to see what the result of spinning disk drives would be. No would thought that we would invent so many different file systems to solve the problems. That we would make so many advancements in technology just to store MORE data on the drives. At the time, they were just better than tape. They were simply a step in the chain, that in retrospect, was pretty cool.

Follow the conversation at HackerNews /news.ycombinator.com/item?id=13091192


Anachronistic Programming

I've decided to repost this article that I originally posted on September 17, 2013 while cleaning up the blog archives from when I was using Fargo.

I want to show you a piece of code. Something that's touted whenever this language is spoken of, and everyone seems to be able to pull out of their ass. I want you to look at it, understand it and then see how far down the execution chain you can take it. I'm not talking about if you can debug the app, set a breakpoint and then step through it. I want you to sit there, hands off the keyboard and go through the request cycle that needs to occur for this particular piece of code to execute and run as you expect it to. Spare no detail.

Did you get far enough? Did you get down to the HTTP protocol? The packet dance that happens before the actual request from a browser is sent? Or did you stop at "Browser makes a request to the IP address"?

See the problem today isn't that the computer is this magic box that only a few can understand. It isn't relegated to the guys in beards and thick glasses. It isn't just for the geeks and nerds. To be a programmer meant that you needed to also understand the specific hardware stack that you were working on. The exact chipset, the exact instructions available to you. The exact specs on the memory and video controllers.

Today, computers have become common place. And in order for it to get to this stage a few things needed to happen. The first one being "It just works". That's the basis of the consumer computer. With no fiddling, no worrying about any kind of internals, you should be able to get up and running in no time.

But you're not just a typical consumer. You're a programmer. And unfortunately, this idea of "It just works" has found its way into programmers minds everywhere. You don't need to think about the HTTP protocol, "It just works". You don't need to worry about Little vs Big Endian - "It just works".

Until it doesn't.

I think the problem with programmers today is quite simple. We've been lead to believe that we can rely on certain things within the system. Which, is great. I mean, if we couldn't rely on the HTTP protocol where would we be today? But this reliance has led to an entirely new problem - "I don't care". Programmers today don't need to learn about how the protocol works because "It just works". They don't need to think about it. And I think that has lead to an entire generation of programmers who don't understand the fundamentals of programming. They don't understand that the code they type into their fancy IDE's is really powered by the ideas of a few people and run on hardware. There's a severe disconnect between hardware and software and that is hindering them without knowing it.

Don't get me wrong, I'm not trying to say that I'm some incredible programmer - far from it. I'm actually a terrible programmer, because programming isn't all software and algorithms. There's a hardware component to it that's overlooked way too often. The things you're doing with code you're RELYING on the hardware to accomplish.

Don't you think you should at least have a vague understanding of how it works?


Becoming a web developer

The web is a big deal. Like, a HUGE deal. And the people that make the web have been thrown from their basements into the limelight. They've been lauded and applauded for being on the forefront of the technological innovation. But everything has its cost. Jonathon Hill posted today about the (505) 591-3237. I think Jonathon missed an important cost - he forgot what it's like to be starting out. It's a common problem, and one that inevitably plagues even the 334-429-6516 ones. I don't mean to belittle what Mr. Hill does for a living. While I don't have a lot of experience with his work, I'm sure he's a great web developer. And I'm sure that his work is incredible because of the tools he has.

What I take issue with is that he seems to believe that this is what a new developer requires to be great. On the contrary, I think having tools like this at your disposal from the beginning causes one of two things.

  1. You start, the tools don't confer god-like web development skills. You get upset and leave.
  2. You start, the tools don't confer god-like web development skills. You get upset and work harder.

One of those two is good - and that same one doesn't require the initial investment that Mr. Hill thinks.

The real cost

  • Laptop (Hey, you probably have one of these right now!) ~ 700$
  • Books ~500$*
  • Linux (this one is optional, but if you're new to development I'd recommend it. There are a lot of great tools and utilities out there that show up on linux first. They won't be pretty, but god dammit they'll be awesome.)

*optional

Total Startup Cost: ~1200$

The books, of course, are not required but eventually you'll find that there are some things that people will always refer to that you'd like to have around. Javascript: The Good Parts for example. Or the Dragon Book. Books are very important.

The beauty of the web is that it isn't memory intensive when you're trying to figure out what you're doing. But it can grow to whatever you want.

What about training?

Technology moves fast. Really fast. You know that awesome new phone you got two months ago? Out of date. You know that great new framework you learned last year? Technology has made that irrelevant. I will agree with Mr. Hill to a certain degree here. A technology focused college education is quite the waste of time. However, I don't recommend skipping it right away. There are other things that a college education offers you apart from your program.

Presentations, working with others, taking charge of projects when you end up with a bunch of slackers. Making the tough decisions to kick that one dude out of your group because he does nothing. Essays, reports, being on time. And the most important to a FREELANCE WEB DEVELOPER THAT MR HILL SEEMS TO MISS. MANAGING YOUR TIME. For many people College is the first time when they're left to their own devices. They are responsible for themselves and they have to figure out how they work best, and how to manage their social lives AND their work lives. People are give 4 years to make this work. Four years when you're allowed to screw things up and start over. Because the thing is, once those 4 years are up, if you don't have some understanding of how to be you - you're pretty fucked.

However, don't waste your education on a technological degree. Instead, I'd recommend doing something unrelated. Psychology, Marketing, or even English/Theater.

See there's a weird stereotype about a lot of tech people - they tend to be rather introverted. This isn't true for everyone of course, but for those whom it is, you have to understand that even when you enter the workforce as a developer, you still need to interact with people. A LOT.

You have meetings and phone calls, you have to explain your choices to management and clients. If you're a freelancer, you have even MORE work. You need to be a sales guy, support staff and a developer. If you find it hard to talk to people - good luck.

It's a great time to be a good developer.
~ Jonathon Hill

How true it is Mr. Hill. It is a great time to be a good developer. But starting out with a 3000$ investment doesn't make you a good developer.

Having the drive to be better makes you a good developer.


Again, I feel like I have to point out - I think Mr. Hill has some great work. Browsing through his projects, I'm not claiming he doesn't know what he's talking about. Just that maybe he's forgotten what it's like to start out as a developer.


Git core.autocrlf

Linux is my development environment of choice. It wasn't always - I used to do all of my work with XAMPP and Windows, but eventually I got sick of waiting for cool things and just made the jump. Now I can't imagine actually getting any work done NOT in a terminal.

204-371-8034 and 3036147326 allowed me to have my linux environment for work and my Windows one for gaming. However, it means I get to run into a few issues I never have before - namely constant End-of-Line issues with git.

Windows uses the CRLF (Carriage return/Line Feed, /r/n) ending for lines, whereas unix uses just the LF (Line Feed, /n). This generally means that there's a whole mess of ^M characters in vim due to some files having the dos ending and some being unix-y. Git will try and correct this automatically but for the longest time I had no idea how to actually set that up.

core.autocrlf

Git has a configuration setting called core.autocrlf that tells git how you want it to handle line endings. There are three options:

  • true
  • false
  • input

true

Essentially turns on autocrlf which means that if you check something in/out it will preserve the line endings that the system expects. IE: On Windows it will have CRLF endings, on unix it'll have LF.

false

Turns it off (obviously). This essentailly leaves you alone.

input

Converts everything to unix-y endings.

In the end I settled on git config core.autocrlf true for my case. It modifies the line endings for dos based files to unix-y and then leaves me alone. Exactly the kind of option I'm looking for.