The problem with peering from a logistics standpoint

Many ISPs run into this problem as part of their growing pains.  This scenario usually starts happening with their third or 4th peer.

Scenario.  ISP grows beyond the single connection they have.  This can be 10 meg, 100 meg, gig or whatever.  They start out looking for redundancy. The ISP brings in a second provider, usually at around the same bandwidth level.  This way the network has two pretty equal paths to go out.

A unique problem usually develops as the network grows to the point of peaking the capacity of both of these connections.  The ISP has to make a decision. Do they increase the capacity to just one provider? Most don’t have the budget to increase capacities to both providers. Now, if you increase one you are favouring one provider over another until the budget allows you to increase capacity on both. You are essentially in a state where you have to favor one provider in order to keep up capacity.  If you fail over to the smaller pipe things could be just as bad as being down.

This is where many ISPs learn the hard way that BGP is not load balancing. But what about padding, communities, local-pref, and all that jazz? We will get to that.  In the meantime, our ISP may have the opportunity to get to an Internet Exchange (IX) and offload things like streaming traffic.  Traffic returns to a little more balance because you essentially have a 3rd provider with the IX connection. But, they growing pains don’t stop there.

As ISP’s, especially WISPs, have more and more resources to deal with cutting down latency they start seeking out better-peered networks.  The next growing pain that becomes apparent is the networks with lots of high-end peers tend to charge more money.  In order for the ISP to buy bandwidth they usually have to do it in smaller quantities from these types of providers. This introduces the probably of a mismatched pipe size again with a twist. The twist is the more, and better peers a network has the more traffic is going to want to travel to that peer. So, the more expensive peer, which you are probably buying less of, now wants to handle more of your traffic.

So, the network geeks will bring up things like padding, communities, local-pref, and all the tricks BGP has.  But, at the end of the day, BGP is not load balancing.  You can *influence* traffic, but BGP does not allow you to say “I want 100 megs of traffic here, and 500 megs here.”  Keep in mind BGP deals with traffic to and from IP blocks, not the traffic itself.

So, how does the ISP solve this? Knowing about your upstream peers is the first thing.  BGP looking glasses, peer reports such as those from Hurricane Electric, and general news help keep you on top of things.  Things such as new peering points, acquisitions, and new data centers can influence an ISPs traffic.  If your equipment supports things such as netflow, sflow, and other tools you can begin to build a picture of your traffic and what ASNs it is going to. This is your first major step. Get tools to know what ASNs the traffic is going to   You can then take this data, and look at how your own peers are connected with these ASNs.  You will start to see things like provider A is poorly peered with ASN 2906.

Once you know who your peers are and have a good feel on their peering then you can influence your traffic.  If you know you don’t want to send traffic destined for ASN 2906 in or out provider A you can then start to implement AS padding and all the tricks we mentioned before.  But, you need the greater picture before you can do that.

One last note. Peering is dynamic.  You have to keep on top of the ecosystem as a whole.

WPA is not encrypting your customer traffic

There was a Facebook discussion that popped up tonight about how a WISP answers the question “Is your network secure?” There were many good answers and the notion of WEP vs WPA was brought up.

In today’s society, you need end-to-end encryption for data to be secure. An ISP has no control over where the customer traffic is going. Thus, by default, the ISP has no control over customer traffic being secure.  “But Justin, I run WPA on all my aps and backhauls, so my network is secure.”  Again, think about end-to-end connectivity. Every one of your access points can be encrypted, and every one of your backhauls can be encrypted, but what happens when an attacker breaks into your wiring closet and installs a sniffer on a router or switch port?What most people forget is that WPA key encryption is only going on between the router/ap and the user device.  “But I lock down all my ports.” you say.  Okay, what about your upstream? Who is to say your upstream provider doesn’t have a port mirror running that dumps all your customer traffic somewhere.  “Okay, I will just run encrypted tunnels across my entire network!. Ha! let’s see you tear down that argument!”. Again, what happens when it leaves your network?  The encryption stops at the endpoint, which is the edge of your network.

Another thing everyone hears about is hotspots. Every so often the news runs a fear piece on unsecured hotspots.  This is the same concept.  If you connect to an unsecured hotspot, it is not much different than connecting to a hotspot where the WPA2 key is on a sign behind the cashier at the local coffee shop. The only difference is the “hacker” has an easier time grabbing any unsecured traffic you are sending. Notice I said unsecured.  If you are using SSL to connect to a bank site that session is sent over an encrypted session.  No sniffing going on there.  If you have an encrypted VPN the possibility of traffic being sniffed is next to none. I say next to none because certain types of VPNs are more secure than others. Does that mean the ISP providing the Internet to feed that hotspot is insecure? There is no feasible way for the ISP to provide end to end security of user traffic on the open Internet.

These arguments are why things like SSL and VPNs exist. Google Chrome is now expecting all websites to be SSL enabled to be marked as secure. VPNs can ensure end-to-end security, but only between two points.  Eventually, you will have to leave the safety and venture out into the wild west of the internet.  Things like Intranets exist so users can have access to information but still be protected. Even most of that is over encrypted SSL these days so someone can’t install a sniffer in the basement.

So what is a WISP supposed to say about security? The WISP is no more secure than any other ISP, nor are then any less secure.  The real security comes from the customer. Things like making sure their devices are up-to-date on security patches.  This includes the often forgotten router. Things like secure passwords, paying attention to browser warnings, e-mail awareness, and other things are where the real user security lies. VPN connections to work. Using SSL ports on e-mail. Using SSH and Secure RDP for network admins. Firewalls can help, but they don’t encrypt the traffic. Does all traffic need encrypted? no.

Everything you wanted to know about NTP

Network Time Protocol (NTP) is a service that can be used to synchronize time on network connected devices.   Before we dive into what NTP is, we need to understand why we need accurate time.

The obvious thing is network devices need an accurate clock.  Things like log files with the proper time stamp are important in troubleshooting.  Accurate timing also helps with security prevention measures.  Some attacks use vulnerabilities in time stamps to add in bad payloads or manipulate data. Some companies require accurate time stamps on files and transactions as well for compliance purposes.

So what are these Stratum levels I hear about?
NTP has several levels divided into stratum. All this is the distance from the reference clock source.  A clock which relays UTC (Coordinated Universal Time) that has little to no delay (we are talking nanoseconds) are Stratum-0 servers. These are not used on the network. These are usually atomic and GPS clocks.  A Stratum-0 server is connected to time servers or stratum-1 via GPS or a national time and frequency transmission.  A Stratum 1 device is a very accurate device and is not connected to a Stratum-0 clock over a network.  A Stratum-2 clock receives NTP packets from a Stratum-1 server, a Stratum-3 receives packets from a Stratum-2 server, and so on.  It’s all relative of where the NTP is in relationship to Stratum-1 servers.

Why are there levels?
The further you get away from Stratum-0 the more delay there is.  Things like jitter and network delays affect accuracy.  Most of us network engineers are concerned with milliseconds (ms) of latency.  Time servers are concerned with nanoseconds (ns). Even a server directly connected to a Stratum-0 reference will add 8-10 nanoseconds to UTC time.

My Mikrotik has an NTP server built in? Is that good enough?
This depends on what level of accuracy you want. Do you just need to make sure all of your routers have the same time? then synchronizing with an upstream time server is probably good enough. Having 5000 devices with the same time, AND not having to manually set them or keep them in sync manually is a huge deal.

Do you run a VOIP switch or need to be compliant when it comes to transactions on servers or need to be compliant with various things like Sox compliance you may need a more accurate time source.

What can I do for more accurate time?
Usually, a dedicated appliance is what many networks use.  These are purpose built hardware that receives a signal from GPS. the more accurate you need the time, the more expensive it will become.  Devices that need to be accurate to the nanosecond are usually more expensive than ones accurate to a microsecond.

If you google NTP Appliance you will get a bunch of results.  If you want to setp up from what you are doing currently you can look into these links:

http://www.satsignal.eu/ntp/Raspberry-Pi-NTP.html

How to Build a Stratum 1 NTP Server Using A Raspberry Pi

 

Building a Stratum 1 NTP Server with a Raspberry Pi

 

The problem with speedtests

Imagine this scenario. Outside your house, the most awesome super highway has been built.  It has a speed limit of 120 Mile Per Hour.  You calculate at those speeds you can get to and from work 20 minutes earlier. Life is good.  Monday morning comes, you hop in your Nissan GT-R, put on some new leather driving gloves, and crank up some good driving music.  Your pull onto the dedicated on-ramp from your house and are quickly cruising at 120 Miles an hour. You make it into work before most anyone else. Life is good.  

Near the end of the week, you notice more and more of your neighbours and co-workers using this new highway.  Things are still fast, but you can’t get up to speed to work like you could earlier in the week.  As you ponder why you notice you are coming up on the off-ramp to your work.  Traffic is backed up. Everyone is trying to get to the same place.  As you are waiting in the line to get off the super highway, you notice folks passing you by going on down the road at high rates of speed.  You surmise your off-ramp must be congested because it is getting used more now.

Speedtest servers work the same way. A speedtest server is a destination on the information super-highway. Man, there is an oldie term.  To understand how speedtest servers work we need a quick understanding of how the Internet works.   The internet is basically a bunch of virtual cities connected together.  Your local ISP delivers a signal to you via Wireless, Fiber, or some sort of media. When it leaves your house it travels to the ISP’s equipment and is aggregated with your neighbours and sent over faster lines to larger cities. It’s just like a road system. You may get access via a gravel road, which turns into a 2 lane blacktop, which then may turn into a 4 lane highway, and finally a super-highway.  The roads you take depend on where you are going. Your ISP may not have much control over how the traffic flows once it leaves their network.

Bottlenecks can happen anywhere. Anything from fiber optic cuts, oversold capacity, routing issues, and plain old unexpected usage. Why are these important? All of these can affect your speedtest results and can be totally out of control of your ISP and you.  They can also be totally your ISP’s fault. They can also be your fault, just like your car can be.  An underpowered router can be struggling to keep up with your connection. Much like a moped on the above super-highway can’t keep up with a 650 horsepower car to fully utilize the road, your router might not be able to keep up either.  Other things can cause issues such as computer viruses, and low performing components.

Just about any network can become a speedtest.net node or a node with some of the other speedtest sites.  These networks have to meet minimum requirements, but there is no indicator of how utilized these speedtest servers are.  A network could put up one and it’s 100 percent utilized when you go running a speedtest. This doesn’t mean your ISP is slow, just the off-ramp to that speedtest server is slow.

The final thing we want to talk about is the utilization of your internet pipe from your ISP.  This is something most don’t take into consideration.  Let’s go back to our on-ramp analogy.  Your ISP is selling you a connection to the information super-highway.   Say they are selling you a 10 megabyte download connection.  If you have a device in your house streaming an HD Netflix stream, which is typically 5 megs or so, that means you only have 5 megs available for a speedtest while that HD stream is happening. Speedtest only test your current available capacity.  Many folks think a speedtest somehow stops all the traffic on your network, runs the test, and starts the traffic. It doesn’t work that way. A speedtest tests the available capacity at that point in time.  The same is true for any point between you and the speedtest server.  Remember our earlier analogy about slowing down when you got to work because there were so many people trying to get there.  They exceeded the capacity of that destination.  However, that does not mean your connection is necessarily slow because people were zooming past you on their way to less congested destinations.

This is why speedtest results should be taken with a grain of salt. They are a useful tool, but not an absolute. A speedtest server is just a destination.  That destination can have bottlenecks, but others don’t.  Even after this long article, there are many other factors which can affect Internet speed. Things we didn’t touch on like Peering, the technology used, speed limits, and other things can also affect your internet speed to destinations.

Some Random Visio diagram

Below, We have some visio diagrams we have done for customers.

This first design is a customer mesh into a couple of different data centers. We are referring to this as a switch-centric design. This has been talked about in the forums and switch-centric seems like as good as any.

This next design is a netonix switch and a Baicells deployment.

Design for a customer

Use tarpit vs drop for scripts blocking attackers

There are many scripts out there, especially on Mikrotik, which list drop as the action for denying bad guy traffic.  While this isn’t wrong, you could put the tarpit action to better use for actions which are dropping attacking type of traffic.

So what is Tarpit?
Tarpit is fairly simple. When connections come in and are “tarpitted” they don’t go back out. The connection is accepted, but when data transfer begins to happen, the TCP window size is set to zero.  This means no data can be transferred during the session.  The session is held open, and requests from the sender (aka attacker) to close the session are ignored. They must wait for the connection to timeout.

So what’s the downside?
TCP is not really designed to hold onto a connection.  It can be additional overhead on a taxed system.  Most modern firewalls can handle tarpitting without an issue. However, if you get thousands of connections it can overwhelm a system or a particular protocol.

How can I use it?
If you have scripts, such as the SSH drop off the Mikrotik wiki, simply change the action to “tarpit” instead of “drop”.

The state of Data Centers and Co-Location in Indianapolis

We like to refer to Indianapolis, Indiana as an “NFL  City” when explaining the connectivity and peering landscape.  It is not a large network presence like Chicago or Ashburn but has enough networks to make it a place for great interconnects.

At the heart of Indianapolis is the Indy Telcom complex.  www.indytelcom.com (currently down as of this writing).  This is also referred to as the “Henry Street” complex because West Henry Street runs past several of the buildings.   This is a large complex with many buildings on it.

One of the things many of our clients ask about is getting connectivity from building to building on the Indy Telcom campus. Lifeline Data Centers ( www.lifelinedatacenters.com ) operates a carrier hotel at 733 Henry. With at least 30 on-net carriers and access to many more 733 is the place to go for cross-connect connectivity in Indianapolis.   We have been told by Indy Telcom the conduits between the buildings on the campus are 100% full. This makes connectivity challenging at best when going between buildings. The campus has lots of space, but the buildings are on islands if you wish to establish dark fiber cross-connects between buildings. Many carriers have lit services, but due to the ways many carriers provision things getting a strand, or even a wave is not possible.  We do have some options from companies like Zayo or Lightedge for getting connectivity between buildings, but it is not like Chicago or other big Date centers.  However, there is a solution for those looking for to establish interconnections.   Lifeline also operates a facility at 401 North Shadeland, which is referred to as the EastGate facility. This facility is built on 41 acres, is FEDRAMP certified, and has a bunch of features.  There is a dark fiber ring going between 733 and 401.  This is ideal for folks looking for both co-location and connectivity.  Servers and other infrastructure can be housed at Eastgate and connectivity can be pulled from 733.  This solves the 100% full conduit issue with Indy Telcom. MidWest Internet Exchange ( www.midwest-ix.com ) is also on-net at both 401 and 733.

Another location where MidWest-IX is at is  365 Data Centers (http://www.365datacenters.com ) at 701 West Henry.  365 has a national footprint and thus draws some different clients than some of the other facilities.  365 operates Data centers in Tennessee, Michigan, New York, and others. MidWest has dark fiber over to 365 in order to bring them on their Indy fabric.

Another large presence at Henry Street is Lightbound ( www.lightbound.com ).  They have a couple of large facilities. According to PeeringDB, only three carriers are in their 731 facility.   However, their web-site lists 18+ carriers in their facilities. The web-site does not list these carriers.

I am a big fan of peeringdb for knowing who is at what facilities, where peering points are, and other geeky information.  Many of the facilities in Indianapolis are not listed on peering DB.  Some other Data Centers which we know about:

Zayo (www.zayo.com)
LightTower ( www.lightower.com )
Indiana Fiber Network (IFN) (https://ifncom.co/)
Online Tech ( www.onlinetech.com )

On the north side of Indianapolis, you have Expedient ( www.expedient.com ) in Carmel. Expedient says they have “dozens of on net carriers among all markets”.  There are some other data centers in the Indianapolis Metro area. Data Cave in Columbus is within decent driving distance.

Where does Trill and VXLAN fit in your strategy?

As networking trends yo-yo between layer-3 and layer-2,  different protocols have emerged to address issues with large layer-2 networks. Protocols such as Transparent Interconnection of Lots of Links (TRILL), Shortest Path Bridging (SPB), and Virtual Extensible LAN (VXLAN) have emerged to address the need for scalability at Layer2.   Cloud scalability, spanning tree bridging issues, and big broadcast networks start to become a problem in a large data center or cloud environment.

To figure out if things like TRILL is a solution for you, you must understand the problem that is being addressed by TRILL. The same goes for the rest of the mentioned protocols. When it boils down to it the reason for looking at such protocols is you want high switching capacity, low latency, and redundancy.  The current de facto standard of Spanning Tree Protocol (STP) simply is unable to meet the needs of modern layer2 networks.  TRILL addresses the problem of STP’s ability to only allow one network path between switches or ports.  STP prevents loops by managing active layer -2 paths.   TRILL applies Intermediate System-to-Intermediate System protocol (IS-IS), which is a layer3 routing protocol translated to Layer 2 devices.

For those who say TRILL is not the answer things like SPB also known as 802.1aq, and VXLAN are the alternatives. A presentation at NANOG 50 in 2010 addressed some of the SPB vs TRILL debate. This presentation goes into great detail on the differences between the two.

The problem, which is one most folks overlook, is that you can only make a layer 2 network so flat.  The trend for a while, especially in data centers, is to flatten out the network. Is TRILL better? Is SPB better? The problem isn’t what is the better solution to use.  What needs to be addressed is the design philosophy behind why you need to use such things.   Having large Layer2 networks is generally a bad idea. Scaling issues can almost always be solved by Layer-3.

So, and this is where the philosophy starts, is TRILL, SPB, or even VXLAN for you? Yes, but with a very big asterisk. TRILL is one of those stop-gap measures or one of those targeted things to use in specific instances. TRILL reduces complexity and makes layer-2 more robust when compared to MLAG. Where would you use such things? One common decision of whether to use TRILL or not comes in a virtualized environment such as VSPHERE.

Many vendors such as Juniper, have developed their own solutions to such things.  Juniper and their Virtual Chassis solution do away with spanning tree issues, which is what TRILL addresses.   Cisco has FabricPath, which is Cisco’s proprietary TRILL-based solution. Keep in mind, this is still TRILL.   If you want to learn some more about Fabric Path this article by Joel Knight gets to the heart of Fabric path.

Many networks see VXLAN as their upgrade path.  VXLAN allows layer 2 to be stretched across layer 3 boundaries. If you are a “Microsoft person” you probably hear an awful lot about Network Virtualization using Generic Routing Encapsulation (NVGRE) which can encapsulate a layer two frame into IP.

The last thing to consider in this entire debate is how does Software Defined Networking (SDN) play into this. Many folks think controllers will make ECMP and MLAG easy to create and maintain. If centralized controllers have a complete view of the network there is no longer a need to run protocols such as TRILL.   The individual switch no longer makes the decision, the controller does.

Should you use Trill, VXLAN, or any of the others mentioned? If you have a large Layer-2 virtualized environment it might be something to consider.  Are you an ISP, there is a very small case for running TRILL in anything other than your data center. Things such as Carrier Ethernet and MPLS are the way to go.

Netflix, IPv6, and queing

While trying to get my Playstation to download the latest “No Man’s Sky” download quicker I figured I would share a little torch action.  This is showing my wife’s Ipad talking to Netflix while she is watching a streaming TV show. Keep in mind this is just an Ipad, not some 4k TV.

Some things to note as you watch this (no sound).

1.Uncapped the connection bursts to 50-60+ megs.
2.The slower your que the connection the more time it spends downloading data.  At slower ques the bursts last longer.
3.If you are handing out IPv6 to customers you should be queing them as well.

Just something to quick and dirty to keep in mind.